I would suggest, that you analyze the urls your next crawl and then just move those "bad" hosts/domains to an adult index. The next time you do not have to recrawl those urls or just in very long cycles. Or just run a perl script over them, to check if they throw any error messages like 404 etc. Those Urls could be still interesting for the link graph... Do you want to have a Domain, with many adult incoming links? I think the first check and crawl will be much slower. But you can start first to filter your index right now, with queries like "inurl:sexmovies" or whatever. Or run the regexp as queries over your index and move all hosts which come up for several adult queries. Cleaning is very time consuming especially spam cleaning and erotic filtering... ask all these people working for the major search engines...
-----Ursprüngliche Nachricht----- Von: Webmaster [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 22. Oktober 2008 08:31 An: [email protected] Betreff: RE: Extensive web crawl I'm thinking I do not want any adult content at all in my system. There's more than enough of that out there on other engines. Preferably I think I would like to use regex to filter the content while crawling and then use an additional filter combination within the search server itself to ensure that no adult content. Now will adding these 1000+ terms to regex slow down the parsing of the urls excessively? I think when this next crawl is finished I'll see about trying the filtering.. Thanks.. Axel.. -----Original Message----- From: Höchstötter Nadine [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 21, 2008 2:20 AM To: [email protected] Subject: AW: Extensive web crawl Hi sorry, I think my other mail was too large. I have a regexp for such trigger terms, I used those to filter "very" clean queries. But I think it is better to crawl http://www.sex-lexis.com/ and extract the categories as trigger terms. Cheers, Nadine. -----Ursprüngliche Nachricht----- Von: Webmaster [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 21. Oktober 2008 00:09 An: [email protected] Betreff: RE: Extensive web crawl Hi Otis, So far so good.. I have things debugged and have 2TB of disk space to use now for building the index. Currently nutch is fetching about 12 million urls a day and the index is at about 250 million urls. Each fetch round takes about 2 hours for 1-1.5 million urls. It will be a few more weeks yet before I hit the 1 billion mark. So far I have one searchable index built locally for testing that contains 100m urls and it seems to work quite well and resonably fast on a single processor P4 2.7ghz with 1.5GB ram. In 2 weeks I'll be halfway there and have a 500m url index. The next round of fetching should be done tomorrow with another 100m links. I must say I am very impressed with Hadoop and the ease of use factor. I never thought it would be so easy to add and remove nodes at will. It makes things pretty simple for upgrading and changin things around. I can add nodes on the fly while processes are running and it just keeps ticking along. I gave up on using the DMOZ for injections though. To many bad urls and dead sites in the index cause all kinds of errors. I have since just written a php script that I can point at a few sites here and there and it scrapes all the urls into a nice list that I can cut and paste to a bigger list. By modifying the crawl-urlfilter.txt I have been able to get most crawls to return 10m+ urls from an list of 100 injected sites crawling to a depth of 40 (sites with lots of outlinks). When this next round of fetching is done I'm going to inject 10m valid urls from my fresh fetch lists and crawl to a depth of 10 to see what happens. My guess is it will return about 200m urls, this should be an adequate stress test of my sad cluster of outdated machines :) I am however still looking into filtering the results for adult content before I move it off the hadoop cluster and put it on the distributed search nodes local file systems. If all goes well I might put the sandbox up live for beta testing.. Axel.. -----Original Message----- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Monday, October 20, 2008 8:38 AM To: [email protected] Subject: Re: Extensive web crawl Axel, how did this go? I'd love to know if you got to 1B. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Webmaster <[EMAIL PROTECTED]> > To: [email protected] > Sent: Tuesday, October 7, 2008 1:13:29 AM > Subject: Extensive web crawl > > Ok.. > > So I want to index the web.. All of it.. > > Any thoughts on how to automate this so I can just point the spider off on > it's merry way and have it return 20 billion pages? > > So far I've been injecting random portions of the DMOZ mixed with other urls > like directory.yahoo.com and wiki.org. I was hoping this would give me a > good retuen with an unrestricted URL-filter where MY.DOMAIN.COM was replaced > with *.* -- Perhaps this is my error and that should be left as is and the > last line should be +. instead of -. ? > > Anyhow after injecting 2000 urls and a few of my own I still only get back > minimal results in the range of 500 to 600k urls. > > Right now I have a new grawl going with 1 million injected urls from the > DMOZ, I'm thinking that this should return a 20 million page index at > least.. No? > > Anyhow.. I have more HD space on the way and would like to get the indexing > up to 1 billion by the end of the week.. > > Any examples on how to set up the url-filter.txt and regex-filter.txt would > be helpful.. > > Thanks.. > > Axel..
