Doug, I like this solution, simple and elegant
Just a modification which might make it faster for longer URLs. This makes the RE non-greedy, thereby causing it to match without having to examine the whole string. -http://.*(/.+?)/.*?\1/.*?\1.*?/ Thus for the string below it should break at http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/m/2002 /kepaloldal/ As it has seen /kepaloldal three time CC- -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Doug Cutting Sent: Friday, April 22, 2005 3:02 PM To: [email protected] Subject: Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn Doug Cutting wrote: > [EMAIL PROTECTED] wrote: >> I now understad the solution of the 'deply same pages' solution >> reported to JIRA >> (like:http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/ m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/20 01/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2002/k epaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepal oldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepalolda l/m/2002/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/11 21kisputekep.htm). > > Another thing to try would be to write a tool that iterates through > the pagedb by md5 and deletes pages that are duplicates. That would > be scalable. I thought about this a bit more and I don't think it would work. We would need to know which URL caused each page to be added, and that information is lost in the current webdb. The example above and lots of other things like it could easily be rejected with a regular expression that matches URLs with any slash-delimited component repeated three or more times. For example: -http://.*(/.+)/.*\1/.*\1.*/ Doug ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
