> I believe a more restricted URL crawling control for > Nutch is neccessary. I'd like to see it as a future > feature for Nutch. > > Nutch is ideal for a controled domain crawling. Most > Nutch hosts don't have resource as google has.
Exactly, we have a limited amount of crawling capacity. That's fine. The key is to prune out 95% of the pages by carefully controlling crawling. As it is, my crawls are getting filled up by Wikipedia in Afrikaans, domain names that are IP addresses, and other content that I want absolutely none of. Most of this could be fixed by a simple URL filter so that Nutch tools never go to certain unwanted URLs. I would put on the list: 1. Any numeric domain name 2. Any website not running on port 80 3. Any of the Wikipedias that aren't en.wikipedia.org and probably a few other rules that could be set by hand as the crawl progresses. By doing that, even a small Nutch crawl could provide a valuable search. As it is, my segments are full of junk, and I'm not sure how to clear it out. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
