RE: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

Chirag Chaman Fri, 22 Apr 2005 19:22:16 -0700

Yes, I think this should definitely be in the default.

 
Also, here a few that I've used and found to be very useful to avoid junk
from getting in the DB. Maybe helpful for someone doing internet crawling --
these need not go in the default file, but it would be nice if people
contributed regex/filters and an example file is made.
 
# ignore local IPs
-^http://(127|10|192\.168|172)\.*
-^http://localhost.*


#Blocked sites - major ad servers
-atwola\.com
-servedby\.advertising\.
-ad\.doubleclick\.



-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 22, 2005 4:21 PM
To: [email protected]
Subject: Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with
the svn

Chirag Chaman wrote:
> I like this solution, simple and elegant

The credit should go to Gordon Mohr, of the Heritrix crawler.  He suggested
this to me yesterday.

> Just a modification which might make it faster for longer URLs. This 
> makes the RE non-greedy, thereby causing it to match without having to 
> examine the whole string.
> 
> -http://.*(/.+?)/.*?\1/.*?\1.*?/

Should we put something like this in the default url filter config file?

Doug

RE: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

Reply via email to