[Nutch-dev] Re: OPIC score calculation issues

Andrzej Bialecki Tue, 14 Mar 2006 10:51:02 -0800

(Better late than never... I forgot I didn't yet respond to your posting).


Doug Cutting wrote:

I think all that you're saying is that we should not run two CrawlDBupdates at once, right? But there are lots of reasons we cannot dothat besides the OPIC calculation.

When we used WebDB it was possible to overlap generate / fetch / updatecycles, because we would "lock" pages selected by FetchListTool for aperiod of time.

Now we don't do this. The advantage is that we don't have to rewriteCrawlDB. But operations on CrawlDB are considerably faster than onWebDB, perhaps we should consider going back to this method?

Also, the "cash value" of those outlinks that point to URLs not inthe current fetchlist will be dropped, because they won't becollected anywhere.
No, every cash value is used.  The input to a crawl db update includes a
CrawlDatum for every known url, including those just linked to. Ifthe only CrawlDatum for a url is a new outlink from a page crawled,then the score for the page is 1.0 + the score of that outlink.


Of course, you are right, I missed this.

And a final note: CrawlDB.update() uses the initial score valuerecorded in the segment, and NOT the value that is actually found inCrawlDB at the time of the update. This means that if there wasanother update in the meantime, your new score in CrawlDB will beoverwritten with the score based on an older initial value. This iscounter-intuitive - I think CrawlDB.update() should always use thelatest score value found in the current CrawlDB. I.e. inCrawlDBReducer instead of doing:
     result.setScore(result.getScore() + scoreIncrement);

we should do:

     result.setScore(old.getScore() + scoreIncrement);
The change is not quite that simple, since 'old' is sometimes null.So perhaps we need to add an 'score' variable that is set to old.scorewhen old!=null and to 1.0 otherwise (for newly linked pages).
The reason I didn't do it that way was to permit the Fetcher to modifyscores, since I was thinking of the Fetcher as the actor whose actionsare being processed here, and of the CrawlDb as the passive thingacted on. But indeed, if you have another process that's updating aCrawlDb while a Fetcher is running, this may not be the case. So ifwe want to switch things so that the Fetcher is not permitted toadjust scores, then this seems like a reasonable change.

I would vote for implementing this change. The reason is that the activeactor that computes new scores is CrawlDb.update(). Fetcher may provideadditional information to affect the score, but IMHO the logic tocalculate new scores should be concentrated in the update() method.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: OPIC score calculation issues

Reply via email to