> Hello Joe,
> If you are using whole web crawling you should
> change regex-urlfilter.txt 
> insead of crawl-urlfilter.txt.

Hi Piotr,

Thanks for the tip.  I tried that.  I put:

-^http://af.wikipedia.org/

in both regex-urlfilter.txt and crawl-urlfilter.txt. 
I even put in a bogus entry for af.wikipedia.org in my
/etc/hosts, and yet when I run a fetch using

nutch fetch segments/244444444

it still is fetching from af.wikipedia.org, and about
one third of my segment data is in Afrikaans, and of
no value to me.  Is there any other way to do this? 
I'm thinking of putting a rule in the firewall to
block traffic to that IP addr.  But surely there's
some way to tell Fetch "never ever go to this server"?
 That seems like a very important thing to have,
because a) some servers have undesirable content and
b) some servers have "spider trap" content that will
suck in the whole fetch.  Any ideas?

Thanks

> On 7/28/05, Vacuum Joe <[EMAIL PROTECTED]> wrote:
> > I have a simple question: I'm using Nutch to do
> some
> > whole-web crawling (just a small dataset). 
> Somehow
> > Nutch has gotten a lot of URLs from
> af.wikipedia.org
> > into its segments, and when I generate another
> > segments (using -topN 20000) it wants to crawl a
> bunch
> > more urls from af.wikipedia.org.  I don't want to
> > crawl any of the Afrikaans Wikipedia.  Is there a
> way
> > to block that?  Also, I want to block it from ever
> > crawling domains like 33.44.55.66, because those
> are
> > usually very badly configured servers with
> worthless
> > content.
> > 
> > I tried to put those things into
> crawl-urlfilter.txt
> > file and the banned-hosts.txt file, but it seems
> that
> > the fetch command doesn't pay attention to those
> two
> > files.
> > 
> > Should I be using crawl instead of fetch?
> > 
> > 
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to