Chirag Chaman wrote:
I like this solution, simple and elegant

The credit should go to Gordon Mohr, of the Heritrix crawler. He suggested this to me yesterday.


Just a modification which might make it faster for longer URLs. This makes
the RE non-greedy, thereby causing it to match without having to examine the
whole string.

-http://.*(/.+?)/.*?\1/.*?\1.*?/

Should we put something like this in the default url filter config file?

Doug


------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to