Hello Andy,

> What are all the rules in your regex-urlfilter.txt
> file?  When a URL
> is checked against the RegexURLFilter, the URL is
> checked against each
> regex expression iteratively.  If it hits a regex
> rule that matches,
> it will either reject or accept the URL, depending
> on the + or - in
> front of the rule.
> 
> So it may be possible that you have a rule which
> accepts your
> af.wikipedia.org URL's before it even processes the
> -^http://af.wikipedia.org/ rule.

There's nothing.  There are all negative rules until
the end of the file, which ends with a "+." to allow
everything that wasn't denied.

> By the way, the RegexURLFilter class has a main
> method where you can
> feed in URL's via stdin for testing purposes.  This
> has been very
> useful for me in the past.

I'll give it a try.

By the way, some more info:

I deleted the old DB and recreated it before doing the
next fetch there was no improvement.

Being unable to block Fetch from fetching certain URLs
seems like a fatal shortcoming in Nutch.  I want it to
NEVER EVER crawl sites where the URLs are
222.111.44.33 kind of URLs and I also want it to NEVER
EVER crawl sites that are on port 8080, for example. 
I have found that if the site operators don't have
enough resources to get a domain name and run it on
port 80, the site is probably worthless.  And yet
there seems to be no way to get Nutch to not crawl
these sites.

Also there are some sites out there that are
spider-traps, like non-English Wikis, which are full
of undesired content but that Fetch really goes
through them all.

The really really needs to be a file somewhere that
all the Nutch fetching and crawling utilities consult
before they fetch a URL.  I'm almost thinking of doing
this using some kind of web proxy as a firewall
becuase it seems to be a big problem in the fetch
commands.

If I submitted a patch to implement a URL filter for
the Fetch command, would Nutch put it in the next
release?


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to