Nutch is assigning each page a "boost", which is per-page (not
per-query) and I think it is somewhat analogous to Google's PageRank
(though of course I'm sure not the same algorithm).

Is there an exact definition of this boost anywhere, e.g. how it is
calculated within nutch?

The only "definition" I've found is in the code :)

If you look at org.apache.nutch.indexer.IndexSegment.makeDocument, you'll see the call to calculateBoost, which calculates a Lucene document boost value using the page score, the indexer.score.power configuration value, the indexer.boost.by.link.count configuration boolean, and the number of inbound links.

The number of inbound links can only be accurately determined (based on pages crawled, of course) via data from the WebDB, which is why you'd want to run UpdateSegmentsFromDB before indexing pages, if you've got indexer.boost.by.link.count set to true. Or do your indexing after merging all of the segments.

How the page score gets calculated is another topic. I understand the basic approach, which only relies on the injected link score and the internal/external link score factors. But the "real" link analysis algorithm could certainly use a write-up in the Wiki. The specific question, in case Mike is reading, is how nextScore is used (for linked-to pages that have outlinks) in the DistributedAnalysisTool.computeRound() method.

Though maybe the mapred work means that code goes away.

In any case, if you just use default Nutch settings and don't run the DistributedAnalysisTool, then all of the page scores are 1.0. So the Lucene document boost winds up being ln(e + inbound link count). 0 inbound links == 1.0, 10 links = 2.54, 100 links = 4.63, etc.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Reply via email to