We've managed to dig ourselves into a couple of link farms with tens of
thousands of sub-domains.

I didn't notice until they blocked our DNS requests and the Nutch error
rates shot way up.

Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?

I've read a paper on detecting link farms, but from what I remember, it wasn't a slam-dunk to implement.

So far we've relied on manually detecting these, and then pruning the results from the crawldb and adding them to the regex-urlfilter file.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to