RE: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

Chirag Chaman Wed, 27 Apr 2005 11:06:13 -0700

Doug,

The hetrix version is a little constrictive.
It will catch ../junk/junk/junk  but fail to catch junk/a/junk/aa/junk/aaa


The RE below will catch this -- so now a decision needs to be made which
form to catch and which to allow.

CC-



-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 27, 2005 1:34 PM
To: [email protected]
Subject: Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with
the svn

Chirag Chaman wrote:
> I like this solution, simple and elegant
> 
> Just a modification which might make it faster for longer URLs. This 
> makes the RE non-greedy, thereby causing it to match without having to 
> examine the whole string.
> 
> -http://.*(/.+?)/.*?\1/.*?\1.*?/

The Heritrix crawler uses ".*/(.*/)\1{2,}.*".

http://crawler.archive.org/cgi-bin/wiki.pl?ChaffControl

Doug

RE: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

Reply via email to