> I believe a more restricted URL crawling control for
> Nutch is neccessary. I'd like to see it as a future
> feature for Nutch. 
> 
> Nutch is ideal for a controled domain crawling. Most
> Nutch hosts don't have resource as google has.

Exactly, we have a limited amount of crawling
capacity.  That's fine.  The key is to prune out 95%
of the pages by carefully controlling crawling.  As it
is, my crawls are getting filled up by Wikipedia in
Afrikaans, domain names that are IP addresses, and
other content that I want absolutely none of.  Most of
this could be fixed by a simple URL filter so that
Nutch tools never go to certain unwanted URLs.  I
would put on the list:

1. Any numeric domain name
2. Any website not running on port 80
3. Any of the Wikipedias that aren't en.wikipedia.org

and probably a few other rules that could be set by
hand as the crawl progresses.  By doing that, even a
small Nutch crawl could provide a valuable search.

As it is, my segments are full of junk, and I'm not
sure how to clear it out.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to