I don't think it is a slam dunk either, even Google doesn't do a super job of detecting these. I think a lot of it's still done manually.

I think you'd have to look at detecting closed networks or mostly closed networks (since the link farm would be relatively clustered from a link perspective). As noted, not too easy to implement - and why people working in SEO still use this technique to game the SE's.

Besides, it gets crazy fast trying to pin this stuff down. I spoke to someone who was complaining about managing 400+ webhosting accounts. Tough to nail folks going to that level.




Ken Krugler wrote:

We've managed to dig ourselves into a couple of link farms with tens of
thousands of sub-domains.

I didn't notice until they blocked our DNS requests and the Nutch error
rates shot way up.

Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?


I've read a paper on detecting link farms, but from what I remember, it wasn't a slam-dunk to implement.

So far we've relied on manually detecting these, and then pruning the results from the crawldb and adding them to the regex-urlfilter file.

-- Ken



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to