I have a simple question: I'm using Nutch to do some
whole-web crawling (just a small dataset).  Somehow
Nutch has gotten a lot of URLs from af.wikipedia.org
into its segments, and when I generate another
segments (using -topN 20000) it wants to crawl a bunch
more urls from af.wikipedia.org.  I don't want to
crawl any of the Afrikaans Wikipedia.  Is there a way
to block that?  Also, I want to block it from ever
crawling domains like 33.44.55.66, because those are
usually very badly configured servers with worthless
content.

I tried to put those things into crawl-urlfilter.txt
file and the banned-hosts.txt file, but it seems that
the fetch command doesn't pay attention to those two
files.

Should I be using crawl instead of fetch?


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to