I have a simple question: I'm using Nutch to do some whole-web crawling (just a small dataset). Somehow Nutch has gotten a lot of URLs from af.wikipedia.org into its segments, and when I generate another segments (using -topN 20000) it wants to crawl a bunch more urls from af.wikipedia.org. I don't want to crawl any of the Afrikaans Wikipedia. Is there a way to block that? Also, I want to block it from ever crawling domains like 33.44.55.66, because those are usually very badly configured servers with worthless content.
I tried to put those things into crawl-urlfilter.txt file and the banned-hosts.txt file, but it seems that the fetch command doesn't pay attention to those two files. Should I be using crawl instead of fetch? __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
