I like this solution, simple and elegant
Just a modification which might make it faster for longer URLs. This makes the RE non-greedy, thereby causing it to match without having to examine the whole string.
-http://.*(/.+?)/.*?\1/.*?\1.*?/
The Heritrix crawler uses ".*/(.*/)\1{2,}.*".
http://crawler.archive.org/cgi-bin/wiki.pl?ChaffControl
Doug
