Dennis Kubes wrote:
A while back we had some problems with the Visvo index. For instance if you did a search for dallas Google was returned first. This was due to inbound link text because one link to google had inbound link text that said dallas. My response (now looking back not a very good one) was to test the Wikia index without inbound link text.
This is not just a matter of links, it's also an issue of spam pages linking to disreputable sites (as well as legitimate ones). In other words, this is tied to the issue of spam detection and score propagation (or score poisoning).
I think the answer lies in finding the right links. I was going to start with a filter that did some type of similarity measure on the links and ordered them by the most clustered or most similar, the idea being that the truly relevant links will be the most populous and most similar (who knows if that is true). The one thing I am worried about with this is googlebombing. Any ideas on how they get around that?
One thing that we already do in Nutch is that we take only a single unique pair of <anchor, hostName> for every incoming link (see Inlinks.getAnchors()). You could further reduce this by using a domain name instead of host name (many spam sites create millions of meaningless subdomains).
Other techniques would require changes in the current format of LinkDb. Assuming you can tag individual spam pages (or spam sites using host-level db of spam features - see here: http://www.nabble.com/Filter-spam-URLs-to14204931.html#a14212919), you could use this information to carry the "spamminess" score attached to inlinks. This way you could ignore inlinks coming from dubious sites.
Re: google-bombing: I think the way they fix it is case-by-case ... usually there's a big fuss about a google bomb du jour, and some time later the same trick stops working. It could be an algorithmic improvement - but sets or manually added rules are also a kind of algorithm .. ;)
Thinking about a good-enough heuristic to detect this sort of stuff (at least to alert the human operator about a possible problem): if there is a large number of anchors with the same terms, and they point to a page that doesn't contain such terms (or hyponyms of anchor terms), then it's likely a google bomb. What do you think?
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
