Re: Preventing the fetch command from going to certain URLs

Piotr Kosiorowski Fri, 29 Jul 2005 01:14:28 -0700

Hello Joe,
If you are using whole web crawling you should change regex-urlfilter.txt 
insead of crawl-urlfilter.txt.


Piotr

On 7/28/05, Vacuum Joe <[EMAIL PROTECTED]> wrote:
> I have a simple question: I'm using Nutch to do some
> whole-web crawling (just a small dataset).  Somehow
> Nutch has gotten a lot of URLs from af.wikipedia.org
> into its segments, and when I generate another
> segments (using -topN 20000) it wants to crawl a bunch
> more urls from af.wikipedia.org.  I don't want to
> crawl any of the Afrikaans Wikipedia.  Is there a way
> to block that?  Also, I want to block it from ever
> crawling domains like 33.44.55.66, because those are
> usually very badly configured servers with worthless
> content.
> 
> I tried to put those things into crawl-urlfilter.txt
> file and the banned-hosts.txt file, but it seems that
> the fetch command doesn't pay attention to those two
> files.
> 
> Should I be using crawl instead of fetch?
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: Preventing the fetch command from going to certain URLs

Reply via email to