Hello Joe, If you are using whole web crawling you should change regex-urlfilter.txt insead of crawl-urlfilter.txt.
Piotr On 7/28/05, Vacuum Joe <[EMAIL PROTECTED]> wrote: > I have a simple question: I'm using Nutch to do some > whole-web crawling (just a small dataset). Somehow > Nutch has gotten a lot of URLs from af.wikipedia.org > into its segments, and when I generate another > segments (using -topN 20000) it wants to crawl a bunch > more urls from af.wikipedia.org. I don't want to > crawl any of the Afrikaans Wikipedia. Is there a way > to block that? Also, I want to block it from ever > crawling domains like 33.44.55.66, because those are > usually very badly configured servers with worthless > content. > > I tried to put those things into crawl-urlfilter.txt > file and the banned-hosts.txt file, but it seems that > the fetch command doesn't pay attention to those two > files. > > Should I be using crawl instead of fetch? > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com >
