[Nutch-general] Re: Preventing the fetch command from going to certain URLs

Feng (Michael) Ji Tue, 02 Aug 2005 18:39:11 -0700

Agree,

I believe a more restricted URL crawling control for
Nutch is neccessary. I'd like to see it as a future
feature for Nutch.


Nutch is ideal for a controled domain crawling. Most
Nutch hosts don't have resource as google has.

Michael,

--- Vacuum Joe <[EMAIL PROTECTED]> wrote:

> Hello Andy,
> 
> > What are all the rules in your regex-urlfilter.txt
> > file?  When a URL
> > is checked against the RegexURLFilter, the URL is
> > checked against each
> > regex expression iteratively.  If it hits a regex
> > rule that matches,
> > it will either reject or accept the URL, depending
> > on the + or - in
> > front of the rule.
> > 
> > So it may be possible that you have a rule which
> > accepts your
> > af.wikipedia.org URL's before it even processes
> the
> > -^http://af.wikipedia.org/ rule.
> 
> There's nothing.  There are all negative rules until
> the end of the file, which ends with a "+." to allow
> everything that wasn't denied.
> 
> > By the way, the RegexURLFilter class has a main
> > method where you can
> > feed in URL's via stdin for testing purposes. 
> This
> > has been very
> > useful for me in the past.
> 
> I'll give it a try.
> 
> By the way, some more info:
> 
> I deleted the old DB and recreated it before doing
> the
> next fetch there was no improvement.
> 
> Being unable to block Fetch from fetching certain
> URLs
> seems like a fatal shortcoming in Nutch.  I want it
> to
> NEVER EVER crawl sites where the URLs are
> 222.111.44.33 kind of URLs and I also want it to
> NEVER
> EVER crawl sites that are on port 8080, for example.
> 
> I have found that if the site operators don't have
> enough resources to get a domain name and run it on
> port 80, the site is probably worthless.  And yet
> there seems to be no way to get Nutch to not crawl
> these sites.
> 
> Also there are some sites out there that are
> spider-traps, like non-English Wikis, which are full
> of undesired content but that Fetch really goes
> through them all.
> 
> The really really needs to be a file somewhere that
> all the Nutch fetching and crawling utilities
> consult
> before they fetch a URL.  I'm almost thinking of
> doing
> this using some kind of web proxy as a firewall
> becuase it seems to be a big problem in the fetch
> commands.
> 
> If I submitted a patch to implement a URL filter for
> the Fetch command, would Nutch put it in the next
> release?
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 



                
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Preventing the fetch command from going to certain URLs

Reply via email to