David Wallace wrote:
I've been grubbing around with Nutch for a while now, although I'm
still working with 0.7 code. I notice that when anchors are collected
for a document, they're made unique by domain and by anchor text.
Note that this is only done when collecting anchor texts, not when
computing page scores.
Suppose my site has 3 pages with links to page X, and the same anchor
text. I'd kind of like to score page X higher than a page where there's
only one incoming link with that anchor text. But I don't want to have
this effect swamping the other calculations of page score. In other
words, if my site has 1000 pages with links to page X, this page should
score a wee bit higher than a similar page with just one incoming link,
but not 1000 times higher.
I'm thinking of doing some maths with the number of repetitions of an
anchor, then including the result in the page score. Something like
log(10+n), or maybe n/(n+2); where n is the number of incoming links
with the same anchor text. Either of these formulas would make 1000
incoming links score roughly 3 times higher than a single incoming link,
which seems about right to me.
Page scores currently are sqrt(OPIC) in the Nutch trunk.
http://www.nabble.com/-Fwd%3A-Fetch-list-priority--t360125.html#a997304
The OPIC calculation does not consider the domain or anchor text.
Hope this helps.
Doug