IlTrovatore check: e' SPAM? Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

massimo miccoli Fri, 22 Apr 2005 19:32:41 -0700

Doug,

Is possible to have some tools like an "alert system" that monitor the presence of massive urls from same site? I think at a tool that start at updatedb time, so we can delete (with a prunedb tool) the suspect sites and insert the relative regex on regex-urlfilter. We know that, for example, ebay have many urls and geocities also, but we don't know the potential web spammers or we can know only by the big numbers of ulrs it try tu insert in webdb.

Sorry for my bad english.

Thanks,

Massimo


Il giorno 22/apr/05, alle 21:31, Chirag Chaman ha scritto:

Doug,
I like this solution, simple and elegant
Just a modification which might make it faster for longer URLs. This makes the RE non-greedy, thereby causing it to match without having to examine the whole string.
-http://.*(/.+?)/.*?\1/.*?\1.*?/
Thus for the string below it should break at http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/ m/2002 /kepaloldal/
As it has seen /kepaloldal three time
CC-
-----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Doug Cutting Sent: Friday, April 22, 2005 3:02 PM To: [email protected] Subject: Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

Doug Cutting wrote:
[EMAIL PROTECTED] wrote:
I now understad the solution of the 'deply same pages' solution
reported to JIRA
(like:http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/ kepaloldal/ m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/ kepaloldal/m/20 01/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/ 2002/k epaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/ 2002/kepal oldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/ kepalolda l/m/2002/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2001/ kepaloldal/11 21kisputekep.htm).
Another thing to try would be to write a tool that iterates through
the pagedb by md5 and deletes pages that are duplicates.  That would
be scalable.
I thought about this a bit more and I don't think it would work. We would need to know which URL caused each page to be added, and that information is lost in the current webdb.

The example above and lots of other things like it could easily be rejected with a regular expression that matches URLs with any slash-delimited component repeated three or more times. For example:
-http://.*(/.+)/.*\1/.*\1.*/
Doug
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

IlTrovatore check: e' SPAM? Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

Reply via email to