also how does it keep track of incoming links globally on these pages, if
the weight is determined by # of incoming links then there would have to be
somewhere it keeps track so when you split your indexes it can still have an
accurate value for the distributed search?

The WebDB keeps track of this info. It's not in the segments/indexes.

 > at which step does nutch figure out the weight of each page, the updatedb
 > step? or the index step?

The updatedb step.

In UpdateDatabaseTool.java's PageContentChanged() method, first all of the outlink URLs are harvested from the fetched page. Then a score is calculated for each of the pages referenced by these outlink URLs, based on the score of the fetched page, multiplied by either the internal or external link weight (from Nutch config XML data, both 1.0 by default), depending on whether the URL is in the same domain as the fetched page.

When you inject URLs, there is no referring page, so it arbitrarily uses the db.score.injected value (1.0 by default).

So if you leave everything set to default values, and don't perform link analysis, I think every page will wind up with a score of 1.0.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Reply via email to