OPIC score calculation issues

Andrzej Bialecki Mon, 27 Feb 2006 15:15:41 -0800

Hi all,

Score calculations in trunk/ follow a different model than in 0.7.x. I'dlike to start a discussion on this - it was a significant change, and myfeeling is that its consequences are not completely clear. I have somedoubts myself, and I'd like to be sure what is going on.

A quick background: where Nutch 0.7.x used a variant of PageRank, Nutch0.8 uses so called "Online Page Index Calculation" (OPIC) method. TheOPIC score calculations are based on the idea that each page isinitially assigned a certain "cash value", and it can distribute partsof its cash value to its outgoing links. At the same time, the pagereceives cash value from any incoming links pointing to it. This way wecan adjust the page importance based on the "quality" of pages linkedto/from it.


In terms of the current Nutch implementation:

* Injector: records the initial cash value of a page when it's injectedto CrawlDB. The default value is implicitly set to 1.0f (see thedeclaration of "private float score" in CrawlDatum - not too smart, IMOthis value should be set explicitly by Injector, based on a configproperty or cmd-line arguments).

* Generator: selects some pages (i.e. CrawlDatum with the initial score)and puts them in crawl_generate.

* Fetcher processes these CrawlDatum-s and passes these initial "cashvalues" down, to be recorded in crawl_parse.

* Parsing (invoked either from Fetcher or from ParseSegment) discoversoutlinks, and records them using ParseOutputFormat, where each outlinkends up under the outlink's target URL in crawl_parse (asCrawlDatum.LINKED). It also sets their scores to a fraction of theoriginating page's score (i.e. outlink_score = orig_page_score /num_outlinks).

* CrawlDBReducer (used by CrawlDB.update()) collects all CrawlDatum-sfrom crawl_parse with the same URL, which means that we get:


   * the original CrawlDatum
   * (optionally a CrawlDatum that contains just a Signature)

* all CrawlDatum.LINKED entries pointing to our URL, generated byoutlinks from other

     pages.

Based on this information, a new score is calculated by adding theoriginal score and all

 scores from incoming links.

HOWEVER... and here's where I suspect the current code is wrong: sincewe are processing just one segment the incoming link information is veryincomplete because it comes only from the outlinks discovered byfetching this segment's fetchlist, and not the complete LinkDB.

One mitigating factor could be that we already accounted for incominglinks from other segments when processing those other segments - so ourinitial score already includes the inlink information from othersegments. But this assumes that we never generate and process more than1 segment in parallel, i.e. that we finish updating from all previoussegments before we update from the current segment (otherwise wewouldn't know the updated initial score).

Also, the "cash value" of those outlinks that point to URLs not in thecurrent fetchlist will be dropped, because they won't be collectedanywhere. LinkDB doesn't store the link "cash values", they are notstored in CrawlDB either.

I think a better option would be to add the LinkDB as an input dir toCrawlDB.update(), so that we have access to all previously collectedinlinks. The problem is however that we don't keep the contributing"cash values" per inlink... but we could add this value to each Inlinkwhen we run LinkDB.invertlinks().

And a final note: CrawlDB.update() uses the initial score value recordedin the segment, and NOT the value that is actually found in CrawlDB atthe time of the update. This means that if there was another update inthe meantime, your new score in CrawlDB will be overwritten with thescore based on an older initial value. This is counter-intuitive - Ithink CrawlDB.update() should always use the latest score value found inthe current CrawlDB. I.e. in CrawlDBReducer instead of doing:


     result.setScore(result.getScore() + scoreIncrement);

we should do:

     result.setScore(old.getScore() + scoreIncrement);

-------------

I hope all this is not too confusing ... I hope I'm not the one who ishopelessly confused ;-)


Any comments, corrections and suggestions appreciated!

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

OPIC score calculation issues

Reply via email to