Chirag Chaman wrote:
I like this solution, simple and elegant

Just a modification which might make it faster for longer URLs. This makes
the RE non-greedy, thereby causing it to match without having to examine the
whole string.

-http://.*(/.+?)/.*?\1/.*?\1.*?/

The Heritrix crawler uses ".*/(.*/)\1{2,}.*".

http://crawler.archive.org/cgi-bin/wiki.pl?ChaffControl

Doug



Reply via email to