I like this solution, simple and elegant
The credit should go to Gordon Mohr, of the Heritrix crawler. He suggested this to me yesterday.
Just a modification which might make it faster for longer URLs. This makes the RE non-greedy, thereby causing it to match without having to examine the whole string.
-http://.*(/.+?)/.*?\1/.*?\1.*?/
Should we put something like this in the default url filter config file?
Doug
