Hi folks,

Offhand, I'm not aware of any slam-dunk solution to link farms either. One thing that could help mitigate the problem is a pre-built blacklist of some sort. For example:

http://www.squidguard.org/blacklist/

That one is really meant for blocking user-access to porn, known warez providers, etc, but it may have some value for you.

Another source of link farms are parked-domain providers. Many of these can be identified by their DNS server name. Some of the top offenders (afaik) include:
- dns(\d+).name-services.com
- ns(\d+).directnic.com
- ns(\d+).itsyourdomain.com
- park(\d+).secureserver.net
- ns.buydomains.com
- this-domain-for-sale.com

A reasonable first-pass at this list can be achieved by getting the Verisign COM Zone file, getting a count of domains per DNS server, then checking the top 100 or so. (that's what i did, anyway! :)

Rob, does that help you? Or are you hitting a different type of link farm?

--Matt

On Mar 7, 2006, at 5:13 PM, Stefan Groschupf wrote:

Hi,

is the content of the pages 'mostly' identically?
Since we can now provide custom hash implementations to the crawlDB, what people think about local sensitive hashing?

http://citeseer.ist.psu.edu/haveliwala00scalable.html

As far I understand the paper we can implement the hashing in a style that it allows to handle 'similar' (just change one word ) pages as once. My experience of link farms is that pages are identically except of one number or word or data or something like that. In such a case LSH may could be a interesting try to get the problem solved.

Any thoughts?

Stefan


Am 07.03.2006 um 22:38 schrieb Ken Krugler:

We've managed to dig ourselves into a couple of link farms with tens of
thousands of sub-domains.

I didn't notice until they blocked our DNS requests and the Nutch error
rates shot way up.

Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?

I've read a paper on detecting link farms, but from what I remember, it wasn't a slam-dunk to implement.

So far we've relied on manually detecting these, and then pruning the results from the crawldb and adding them to the regex- urlfilter file.

-- Ken



--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to