Re: Inbound Link Text

Dennis Kubes Fri, 11 Jan 2008 07:05:44 -0800


Andrzej Bialecki wrote:

Dennis Kubes wrote:
A while back we had some problems with the Visvo index. For instanceif you did a search for dallas Google was returned first. This wasdue to inbound link text because one link to google had inbound linktext that said dallas. My response (now looking back not a very goodone) was to test the Wikia index without inbound link text.
This is not just a matter of links, it's also an issue of spam pageslinking to disreputable sites (as well as legitimate ones). In otherwords, this is tied to the issue of spam detection and score propagation(or score poisoning).

Yes it is about spam domains, I agree. What I was saying is, takedomains like google, adobe acrobat reader, flash, or statcounter. Thesedomains have thousands of inbound links. We have a max number of linksfor storing in linkdb and those get passed to indexer. For those typesof pages we just store the first x links (At least I think it is first:)). We don't have a method right now of determining which links arestored and which are discarded. Do you think this would be a good placefor an extension so people could have multiple anchor filters?

I think the answer lies in finding the right links. I was going tostart with a filter that did some type of similarity measure on thelinks and ordered them by the most clustered or most similar, the ideabeing that the truly relevant links will be the most populous and mostsimilar (who knows if that is true). The one thing I am worried aboutwith this is googlebombing. Any ideas on how they get around that?
One thing that we already do in Nutch is that we take only a singleunique pair of <anchor, hostName> for every incoming link (seeInlinks.getAnchors()). You could further reduce this by using a domainname instead of host name (many spam sites create millions ofmeaningless subdomains).

Right because Inlinks is HashSet backed. That is an interesting ideafor the domains.

Other techniques would require changes in the current format of LinkDb.Assuming you can tag individual spam pages (or spam sites usinghost-level db of spam features - see here:http://www.nabble.com/Filter-spam-URLs-to14204931.html#a14212919), youcould use this information to carry the "spamminess" score attached toinlinks. This way you could ignore inlinks coming from dubious sites.

What we have been talking about with Grub, instead of just a crawler isto create a type of WebDB using HBase. This would store all types ofmeta-data about a page that users can add, such as duplicate domains,spammyness of pages, owners, all that type of stuff. Then this could bebrought back in through Nutch for the spammy calculations. Of courseanybody will be able to add to and download this information.

Re: google-bombing: I think the way they fix it is case-by-case ...usually there's a big fuss about a google bomb du jour, and some timelater the same trick stops working. It could be an algorithmicimprovement - but sets or manually added rules are also a kind ofalgorithm .. ;)

True it is not hugely important and could be fixed case by caseconsidering how long it would take for this to propagate.

Thinking about a good-enough heuristic to detect this sort of stuff (atleast to alert the human operator about a possible problem): if there isa large number of anchors with the same terms, and they point to a pagethat doesn't contain such terms (or hyponyms of anchor terms), then it'slikely a google bomb. What do you think?

That is exactly what I was thinking. If large number of same terms thatdon't match page content then ignore those links.


Dennis Kubes

Re: Inbound Link Text

Reply via email to