Yes, I think this should definitely be in the default. Also, here a few that I've used and found to be very useful to avoid junk from getting in the DB. Maybe helpful for someone doing internet crawling -- these need not go in the default file, but it would be nice if people contributed regex/filters and an example file is made. # ignore local IPs -^http://(127|10|192\.168|172)\.* -^http://localhost.*
#Blocked sites - major ad servers -atwola\.com -servedby\.advertising\. -ad\.doubleclick\. -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Friday, April 22, 2005 4:21 PM To: [email protected] Subject: Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn Chirag Chaman wrote: > I like this solution, simple and elegant The credit should go to Gordon Mohr, of the Heritrix crawler. He suggested this to me yesterday. > Just a modification which might make it faster for longer URLs. This > makes the RE non-greedy, thereby causing it to match without having to > examine the whole string. > > -http://.*(/.+?)/.*?\1/.*?\1.*?/ Should we put something like this in the default url filter config file? Doug
