I don't think it is a slam dunk either, even Google doesn't do a super
job of detecting these. I think a lot of it's still done manually.
I think you'd have to look at detecting closed networks or mostly closed
networks (since the link farm would be relatively clustered from a link
perspective). As noted, not too easy to implement - and why people
working in SEO still use this technique to game the SE's.
Besides, it gets crazy fast trying to pin this stuff down. I spoke to
someone who was complaining about managing 400+ webhosting accounts.
Tough to nail folks going to that level.
Ken Krugler wrote:
We've managed to dig ourselves into a couple of link farms with tens of
thousands of sub-domains.
I didn't notice until they blocked our DNS requests and the Nutch error
rates shot way up.
Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?
I've read a paper on detecting link farms, but from what I remember,
it wasn't a slam-dunk to implement.
So far we've relied on manually detecting these, and then pruning the
results from the crawldb and adding them to the regex-urlfilter file.
-- Ken
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general