Nutch is assigning each page a "boost", which is per-page (not
per-query) and I think it is somewhat analogous to Google's PageRank
(though of course I'm sure not the same algorithm).
Is there an exact definition of this boost anywhere, e.g. how it is
calculated within nutch?
The only "definition" I've found is in the code :)
If you look at org.apache.nutch.indexer.IndexSegment.makeDocument,
you'll see the call to calculateBoost, which calculates a Lucene
document boost value using the page score, the indexer.score.power
configuration value, the indexer.boost.by.link.count configuration
boolean, and the number of inbound links.
The number of inbound links can only be accurately determined (based
on pages crawled, of course) via data from the WebDB, which is why
you'd want to run UpdateSegmentsFromDB before indexing pages, if
you've got indexer.boost.by.link.count set to true. Or do your
indexing after merging all of the segments.
How the page score gets calculated is another topic. I understand the
basic approach, which only relies on the injected link score and the
internal/external link score factors. But the "real" link analysis
algorithm could certainly use a write-up in the Wiki. The specific
question, in case Mike is reading, is how nextScore is used (for
linked-to pages that have outlinks) in the
DistributedAnalysisTool.computeRound() method.
Though maybe the mapred work means that code goes away.
In any case, if you just use default Nutch settings and don't run the
DistributedAnalysisTool, then all of the page scores are 1.0. So the
Lucene document boost winds up being ln(e + inbound link count). 0
inbound links == 1.0, 10 links = 2.54, 100 links = 4.63, etc.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200