Hi folks,
Offhand, I'm not aware of any slam-dunk solution to link farms
either. One thing that could help mitigate the problem is a pre-built
blacklist of some sort. For example:
http://www.squidguard.org/blacklist/
That one is really meant for blocking user-access to porn, known
warez providers, etc, but it may have some value for you.
Another source of link farms are parked-domain providers. Many of
these can be identified by their DNS server name. Some of the top
offenders (afaik) include:
- dns(\d+).name-services.com
- ns(\d+).directnic.com
- ns(\d+).itsyourdomain.com
- park(\d+).secureserver.net
- ns.buydomains.com
- this-domain-for-sale.com
A reasonable first-pass at this list can be achieved by getting the
Verisign COM Zone file, getting a count of domains per DNS server,
then checking the top 100 or so. (that's what i did, anyway! :)
Rob, does that help you? Or are you hitting a different type of link
farm?
--Matt
On Mar 7, 2006, at 5:13 PM, Stefan Groschupf wrote:
Hi,
is the content of the pages 'mostly' identically?
Since we can now provide custom hash implementations to the
crawlDB, what people think about local sensitive hashing?
http://citeseer.ist.psu.edu/haveliwala00scalable.html
As far I understand the paper we can implement the hashing in a
style that it allows to handle 'similar' (just change one word )
pages as once.
My experience of link farms is that pages are identically except of
one number or word or data or something like that.
In such a case LSH may could be a interesting try to get the
problem solved.
Any thoughts?
Stefan
Am 07.03.2006 um 22:38 schrieb Ken Krugler:
We've managed to dig ourselves into a couple of link farms with
tens of
thousands of sub-domains.
I didn't notice until they blocked our DNS requests and the Nutch
error
rates shot way up.
Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?
I've read a paper on detecting link farms, but from what I
remember, it wasn't a slam-dunk to implement.
So far we've relied on manually detecting these, and then pruning
the results from the crawldb and adding them to the regex-
urlfilter file.
-- Ken
--
Matt Kangas / [EMAIL PROTECTED]
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general