Doug, The hetrix version is a little constrictive. It will catch ../junk/junk/junk but fail to catch junk/a/junk/aa/junk/aaa
The RE below will catch this -- so now a decision needs to be made which form to catch and which to allow. CC- -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 27, 2005 1:34 PM To: [email protected] Subject: Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn Chirag Chaman wrote: > I like this solution, simple and elegant > > Just a modification which might make it faster for longer URLs. This > makes the RE non-greedy, thereby causing it to match without having to > examine the whole string. > > -http://.*(/.+?)/.*?\1/.*?\1.*?/ The Heritrix crawler uses ".*/(.*/)\1{2,}.*". http://crawler.archive.org/cgi-bin/wiki.pl?ChaffControl Doug ------------------------------------------------------- SF.Net email is sponsored by: Tell us your software development plans! Take this survey and enter to win a one-year sub to SourceForge.net Plus IDC's 2005 look-ahead and a copy of this survey Click here to start! http://www.idcswdc.com/cgi-bin/survey?id=105hix _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
