RE: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

Chirag Chaman Wed, 27 Apr 2005 11:09:33 -0700

Doug,

The hetrix version is a little constrictive.
It will catch ../junk/junk/junk  but fail to catch junk/a/junk/aa/junk/aaa


The RE below will catch this -- so now a decision needs to be made which
form to catch and which to allow.

CC-



-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 27, 2005 1:34 PM
To: [email protected]
Subject: Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with
the svn

Chirag Chaman wrote:
> I like this solution, simple and elegant
> 
> Just a modification which might make it faster for longer URLs. This 
> makes the RE non-greedy, thereby causing it to match without having to 
> examine the whole string.
> 
> -http://.*(/.+?)/.*?\1/.*?\1.*?/

The Heritrix crawler uses ".*/(.*/)\1{2,}.*".

http://crawler.archive.org/cgi-bin/wiki.pl?ChaffControl

Doug





-------------------------------------------------------
SF.Net email is sponsored by: Tell us your software development plans!
Take this survey and enter to win a one-year sub to SourceForge.net
Plus IDC's 2005 look-ahead and a copy of this survey
Click here to start!  http://www.idcswdc.com/cgi-bin/survey?id=105hix
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

RE: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

Reply via email to