Doug,

I like this solution, simple and elegant

Just a modification which might make it faster for longer URLs. This makes
the RE non-greedy, thereby causing it to match without having to examine the
whole string.

-http://.*(/.+?)/.*?\1/.*?\1.*?/

Thus for the string below it should break at  
http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/m/2002
/kepaloldal/

As it has seen /kepaloldal three time

CC-


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Doug
Cutting
Sent: Friday, April 22, 2005 3:02 PM
To: [email protected]
Subject: Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with
the svn

Doug Cutting wrote:
> [EMAIL PROTECTED] wrote:
>> I now understad the solution of the 'deply same pages' solution 
>> reported to JIRA 
>>
(like:http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/
m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/20
01/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2002/k
epaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepal
oldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepalolda
l/m/2002/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/11
21kisputekep.htm).
> 
> Another thing to try would be to write a tool that iterates through 
> the pagedb by md5 and deletes pages that are duplicates.  That would 
> be scalable.

I thought about this a bit more and I don't think it would work.  We would
need to know which URL caused each page to be added, and that information is
lost in the current webdb.

The example above and lots of other things like it could easily be rejected
with a regular expression that matches URLs with any slash-delimited
component repeated three or more times.  For example:

-http://.*(/.+)/.*\1/.*\1.*/

Doug


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide Read honest & candid reviews
on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to