Re: Preventing the fetch command from going to certain URLs

Feng \(Michael\) Ji Sun, 31 Jul 2005 05:46:42 -0700

did you use script call of bin/nutch crawl...

I used to try one time, if I want a intranet crawling
and put a restrict domain in crawl-urlfilter.txt, it
only fetching pages within that domain,


Michael,

--- Vacuum Joe <[EMAIL PROTECTED]> wrote:

> 
> > Hello Joe,
> > If you are using whole web crawling you should
> > change regex-urlfilter.txt 
> > insead of crawl-urlfilter.txt.
> 
> Hi Piotr,
> 
> Thanks for the tip.  I tried that.  I put:
> 
> -^http://af.wikipedia.org/
> 
> in both regex-urlfilter.txt and crawl-urlfilter.txt.
> 
> I even put in a bogus entry for af.wikipedia.org in
> my
> /etc/hosts, and yet when I run a fetch using
> 
> nutch fetch segments/244444444
> 
> it still is fetching from af.wikipedia.org, and
> about
> one third of my segment data is in Afrikaans, and of
> no value to me.  Is there any other way to do this? 
> I'm thinking of putting a rule in the firewall to
> block traffic to that IP addr.  But surely there's
> some way to tell Fetch "never ever go to this
> server"?
>  That seems like a very important thing to have,
> because a) some servers have undesirable content and
> b) some servers have "spider trap" content that will
> suck in the whole fetch.  Any ideas?
> 
> Thanks
> 
> > On 7/28/05, Vacuum Joe <[EMAIL PROTECTED]>
> wrote:
> > > I have a simple question: I'm using Nutch to do
> > some
> > > whole-web crawling (just a small dataset). 
> > Somehow
> > > Nutch has gotten a lot of URLs from
> > af.wikipedia.org
> > > into its segments, and when I generate another
> > > segments (using -topN 20000) it wants to crawl a
> > bunch
> > > more urls from af.wikipedia.org.  I don't want
> to
> > > crawl any of the Afrikaans Wikipedia.  Is there
> a
> > way
> > > to block that?  Also, I want to block it from
> ever
> > > crawling domains like 33.44.55.66, because those
> > are
> > > usually very badly configured servers with
> > worthless
> > > content.
> > > 
> > > I tried to put those things into
> > crawl-urlfilter.txt
> > > file and the banned-hosts.txt file, but it seems
> > that
> > > the fetch command doesn't pay attention to those
> > two
> > > files.
> > > 
> > > Should I be using crawl instead of fetch?
> > > 
> > > 
> > >
> __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam?  Yahoo! Mail has the best spam
> > protection around
> > > http://mail.yahoo.com
> > >
> > 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Preventing the fetch command from going to certain URLs

Reply via email to