Chirag Chaman wrote:
I like this solution, simple and elegant
Just a modification which might make it faster for longer URLs. This makes
the RE non-greedy, thereby causing it to match without having to examine the
whole string.
-http://.*(/.+?)/.*?\1/.*?\1.*?/
The Heritrix crawler uses ".*/(.*/)\1{2,}.*".
http://crawler.archive.org/cgi-bin/wiki.pl?ChaffControl
Doug
-------------------------------------------------------
SF.Net email is sponsored by: Tell us your software development plans!
Take this survey and enter to win a one-year sub to SourceForge.net
Plus IDC's 2005 look-ahead and a copy of this survey
Click here to start! http://www.idcswdc.com/cgi-bin/survey?id=105hix
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers