[Nutch-general] Re: Link Farms

Matt Kangas Wed, 08 Mar 2006 08:58:10 -0800

Hi folks,

Offhand, I'm not aware of any slam-dunk solution to link farmseither. One thing that could help mitigate the problem is a pre-builtblacklist of some sort. For example:


http://www.squidguard.org/blacklist/

That one is really meant for blocking user-access to porn, knownwarez providers, etc, but it may have some value for you.

Another source of link farms are parked-domain providers. Many ofthese can be identified by their DNS server name. Some of the topoffenders (afaik) include:

- dns(\d+).name-services.com
- ns(\d+).directnic.com
- ns(\d+).itsyourdomain.com
- park(\d+).secureserver.net
- ns.buydomains.com
- this-domain-for-sale.com

A reasonable first-pass at this list can be achieved by getting theVerisign COM Zone file, getting a count of domains per DNS server,then checking the top 100 or so. (that's what i did, anyway! :)

Rob, does that help you? Or are you hitting a different type of linkfarm?


--Matt

On Mar 7, 2006, at 5:13 PM, Stefan Groschupf wrote:

Hi,

is the content of the pages 'mostly' identically?
Since we can now provide custom hash implementations to thecrawlDB, what people think about local sensitive hashing?
http://citeseer.ist.psu.edu/haveliwala00scalable.html
As far I understand the paper we can implement the hashing in astyle that it allows to handle 'similar' (just change one word )pages as once.My experience of link farms is that pages are identically except ofone number or word or data or something like that.In such a case LSH may could be a interesting try to get theproblem solved.
Any thoughts?

Stefan


Am 07.03.2006 um 22:38 schrieb Ken Krugler:
We've managed to dig ourselves into a couple of link farms withtens of
thousands of sub-domains.
I didn't notice until they blocked our DNS requests and the Nutcherror
rates shot way up.

Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?
I've read a paper on detecting link farms, but from what Iremember, it wasn't a slam-dunk to implement.
So far we've relied on manually detecting these, and then pruningthe results from the crawldb and adding them to the regex-urlfilter file.
-- Ken


--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Link Farms

Reply via email to