Hello Andy, > What are all the rules in your regex-urlfilter.txt > file? When a URL > is checked against the RegexURLFilter, the URL is > checked against each > regex expression iteratively. If it hits a regex > rule that matches, > it will either reject or accept the URL, depending > on the + or - in > front of the rule. > > So it may be possible that you have a rule which > accepts your > af.wikipedia.org URL's before it even processes the > -^http://af.wikipedia.org/ rule.
There's nothing. There are all negative rules until the end of the file, which ends with a "+." to allow everything that wasn't denied. > By the way, the RegexURLFilter class has a main > method where you can > feed in URL's via stdin for testing purposes. This > has been very > useful for me in the past. I'll give it a try. By the way, some more info: I deleted the old DB and recreated it before doing the next fetch there was no improvement. Being unable to block Fetch from fetching certain URLs seems like a fatal shortcoming in Nutch. I want it to NEVER EVER crawl sites where the URLs are 222.111.44.33 kind of URLs and I also want it to NEVER EVER crawl sites that are on port 8080, for example. I have found that if the site operators don't have enough resources to get a domain name and run it on port 80, the site is probably worthless. And yet there seems to be no way to get Nutch to not crawl these sites. Also there are some sites out there that are spider-traps, like non-English Wikis, which are full of undesired content but that Fetch really goes through them all. The really really needs to be a file somewhere that all the Nutch fetching and crawling utilities consult before they fetch a URL. I'm almost thinking of doing this using some kind of web proxy as a firewall becuase it seems to be a big problem in the fetch commands. If I submitted a patch to implement a URL filter for the Fetch command, would Nutch put it in the next release? __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
