> I believe a more restricted URL crawling control for
> Nutch is neccessary. I'd like to see it as a future
> feature for Nutch. 
> 
> Nutch is ideal for a controled domain crawling. Most
> Nutch hosts don't have resource as google has.

Exactly, we have a limited amount of crawling
capacity.  That's fine.  The key is to prune out 95%
of the pages by carefully controlling crawling.  As it
is, my crawls are getting filled up by Wikipedia in
Afrikaans, domain names that are IP addresses, and
other content that I want absolutely none of.  Most of
this could be fixed by a simple URL filter so that
Nutch tools never go to certain unwanted URLs.  I
would put on the list:

1. Any numeric domain name
2. Any website not running on port 80
3. Any of the Wikipedias that aren't en.wikipedia.org

and probably a few other rules that could be set by
hand as the crawl progresses.  By doing that, even a
small Nutch crawl could provide a valuable search.

As it is, my segments are full of junk, and I'm not
sure how to clear it out.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to