Agree, I believe a more restricted URL crawling control for Nutch is neccessary. I'd like to see it as a future feature for Nutch.
Nutch is ideal for a controled domain crawling. Most Nutch hosts don't have resource as google has. Michael, --- Vacuum Joe <[EMAIL PROTECTED]> wrote: > Hello Andy, > > > What are all the rules in your regex-urlfilter.txt > > file? When a URL > > is checked against the RegexURLFilter, the URL is > > checked against each > > regex expression iteratively. If it hits a regex > > rule that matches, > > it will either reject or accept the URL, depending > > on the + or - in > > front of the rule. > > > > So it may be possible that you have a rule which > > accepts your > > af.wikipedia.org URL's before it even processes > the > > -^http://af.wikipedia.org/ rule. > > There's nothing. There are all negative rules until > the end of the file, which ends with a "+." to allow > everything that wasn't denied. > > > By the way, the RegexURLFilter class has a main > > method where you can > > feed in URL's via stdin for testing purposes. > This > > has been very > > useful for me in the past. > > I'll give it a try. > > By the way, some more info: > > I deleted the old DB and recreated it before doing > the > next fetch there was no improvement. > > Being unable to block Fetch from fetching certain > URLs > seems like a fatal shortcoming in Nutch. I want it > to > NEVER EVER crawl sites where the URLs are > 222.111.44.33 kind of URLs and I also want it to > NEVER > EVER crawl sites that are on port 8080, for example. > > I have found that if the site operators don't have > enough resources to get a domain name and run it on > port 80, the site is probably worthless. And yet > there seems to be no way to get Nutch to not crawl > these sites. > > Also there are some sites out there that are > spider-traps, like non-English Wikis, which are full > of undesired content but that Fetch really goes > through them all. > > The really really needs to be a file somewhere that > all the Nutch fetching and crawling utilities > consult > before they fetch a URL. I'm almost thinking of > doing > this using some kind of web proxy as a firewall > becuase it seems to be a big problem in the fetch > commands. > > If I submitted a patch to implement a URL filter for > the Fetch command, would Nutch put it in the next > release? > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
