Andrzej Bialecki wrote:
Dennis Kubes wrote:
A while back we had some problems with the Visvo index. For instance
if you did a search for dallas Google was returned first. This was
due to inbound link text because one link to google had inbound link
text that said dallas. My response (now looking back not a very good
one) was to test the Wikia index without inbound link text.
This is not just a matter of links, it's also an issue of spam pages
linking to disreputable sites (as well as legitimate ones). In other
words, this is tied to the issue of spam detection and score propagation
(or score poisoning).
Yes it is about spam domains, I agree. What I was saying is, take
domains like google, adobe acrobat reader, flash, or statcounter. These
domains have thousands of inbound links. We have a max number of links
for storing in linkdb and those get passed to indexer. For those types
of pages we just store the first x links (At least I think it is first
:)). We don't have a method right now of determining which links are
stored and which are discarded. Do you think this would be a good place
for an extension so people could have multiple anchor filters?
I think the answer lies in finding the right links. I was going to
start with a filter that did some type of similarity measure on the
links and ordered them by the most clustered or most similar, the idea
being that the truly relevant links will be the most populous and most
similar (who knows if that is true). The one thing I am worried about
with this is googlebombing. Any ideas on how they get around that?
One thing that we already do in Nutch is that we take only a single
unique pair of <anchor, hostName> for every incoming link (see
Inlinks.getAnchors()). You could further reduce this by using a domain
name instead of host name (many spam sites create millions of
meaningless subdomains).
Right because Inlinks is HashSet backed. That is an interesting idea
for the domains.
Other techniques would require changes in the current format of LinkDb.
Assuming you can tag individual spam pages (or spam sites using
host-level db of spam features - see here:
http://www.nabble.com/Filter-spam-URLs-to14204931.html#a14212919), you
could use this information to carry the "spamminess" score attached to
inlinks. This way you could ignore inlinks coming from dubious sites.
What we have been talking about with Grub, instead of just a crawler is
to create a type of WebDB using HBase. This would store all types of
meta-data about a page that users can add, such as duplicate domains,
spammyness of pages, owners, all that type of stuff. Then this could be
brought back in through Nutch for the spammy calculations. Of course
anybody will be able to add to and download this information.
Re: google-bombing: I think the way they fix it is case-by-case ...
usually there's a big fuss about a google bomb du jour, and some time
later the same trick stops working. It could be an algorithmic
improvement - but sets or manually added rules are also a kind of
algorithm .. ;)
True it is not hugely important and could be fixed case by case
considering how long it would take for this to propagate.
Thinking about a good-enough heuristic to detect this sort of stuff (at
least to alert the human operator about a possible problem): if there is
a large number of anchors with the same terms, and they point to a page
that doesn't contain such terms (or hyponyms of anchor terms), then it's
likely a google bomb. What do you think?
That is exactly what I was thinking. If large number of same terms that
don't match page content then ignore those links.
Dennis Kubes