Doug,

Is possible to have some tools like an "alert system" that monitor the presence of massive urls from same site?
I think at a tool that start at updatedb time, so we can delete (with a prunedb tool) the suspect sites and insert the relative regex on regex-urlfilter. We know that, for example, ebay have many urls and geocities also, but we don't know the potential web spammers or we can know only by the big numbers of ulrs it try tu insert in webdb.


Sorry for my bad english.

Thanks,

Massimo


Il giorno 22/apr/05, alle 21:31, Chirag Chaman ha scritto:

Doug,

I like this solution, simple and elegant

Just a modification which might make it faster for longer URLs. This makes
the RE non-greedy, thereby causing it to match without having to examine the
whole string.


-http://.*(/.+?)/.*?\1/.*?\1.*?/

Thus for the string below it should break at
http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/ m/2002
/kepaloldal/


As it has seen /kepaloldal three time

CC-


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Doug
Cutting
Sent: Friday, April 22, 2005 3:02 PM
To: [email protected]
Subject: Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with
the svn


Doug Cutting wrote:
[EMAIL PROTECTED] wrote:
I now understad the solution of the 'deply same pages' solution
reported to JIRA

(like:http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/ kepaloldal/
m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/ kepaloldal/m/20
01/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/ 2002/k
epaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/ 2002/kepal
oldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/ kepalolda
l/m/2002/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2001/ kepaloldal/11
21kisputekep.htm).

Another thing to try would be to write a tool that iterates through the pagedb by md5 and deletes pages that are duplicates. That would be scalable.

I thought about this a bit more and I don't think it would work. We would
need to know which URL caused each page to be added, and that information is
lost in the current webdb.


The example above and lots of other things like it could easily be rejected
with a regular expression that matches URLs with any slash-delimited
component repeated three or more times. For example:


-http://.*(/.+)/.*\1/.*\1.*/

Doug


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide Read honest & candid reviews
on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers





-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers





------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to