Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

Doug Cutting Fri, 22 Apr 2005 19:48:22 -0700

Doug Cutting wrote:

[EMAIL PROTECTED] wrote:
I now understad the solution of the 'deply same pages' solution reported to JIRA (like:http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/1121kisputekep.htm).
Another thing to try would be to write a tool that iterates through the pagedb by md5 and deletes pages that are duplicates. That would be scalable.

I thought about this a bit more and I don't think it would work. We would need to know which URL caused each page to be added, and that information is lost in the current webdb.

The example above and lots of other things like it could easily be rejected with a regular expression that matches URLs with any slash-delimited component repeated three or more times. For example:

-http://.*(/.+)/.*\1/.*\1.*/

Doug


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

Reply via email to