Chirag Chaman wrote:
I like this solution, simple and elegant

Just a modification which might make it faster for longer URLs. This makes
the RE non-greedy, thereby causing it to match without having to examine the
whole string.

-http://.*(/.+?)/.*?\1/.*?\1.*?/

The Heritrix crawler uses ".*/(.*/)\1{2,}.*".

http://crawler.archive.org/cgi-bin/wiki.pl?ChaffControl

Doug



-------------------------------------------------------
SF.Net email is sponsored by: Tell us your software development plans!
Take this survey and enter to win a one-year sub to SourceForge.net
Plus IDC's 2005 look-ahead and a copy of this survey
Click here to start!  http://www.idcswdc.com/cgi-bin/survey?id=105hix
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to