There has been some discussion about the implementation of OPIC in Nutch vis-a-vis the proposed implementation in the original paper<http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html>. See these threads, for example:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06539.html http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg02915.html http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg04319.html I would like to know if there are any plans with regards to these improvements. Below, I have summarized the main differences in the implementation, as I understand it. Please correct me if I am wrong or missing something. 1. Nutch does not maintain two values for each node as proposed by Abiteboul et. al: *cash* and *(credit) history*; instead it maintains only one value that it calls *score*. The paper proposes that the *cash* be reset to 0 when it is distributed to the outlinks and the *history* be used as the importance of the page -- there is no parallel to this in Nutch. In fact OPICScoringFilter.java contains the following comment by Andrzej Bialecki: // XXX (ab) no adjustment? I think this is contrary to the algorithm descr. // XXX in the paper, where page "loses" its score if it's distributed to // XXX linked pages... 2. There is no one virtual root page that is assumed to be linked to each page in Nutch as proposed by the paper. 3. The total cash in the system is not constant in Nutch as it is in the paper, it continuously increases. Here is a summary of the problems this causes (again, my understanding): 1. The correctness of the page importance computation in Nutch cannot be readily established. The OPIC paper does prove its algorithm correct, but the algorithm implemented in Nutch is different in a way that the proof in that paper does not hold for this. From the paper: "the correctness of the algorithm requires that each page is read infinitely many times..." This is possible even with the *Greedy* policy which Nutch uses for ordering the crawl *if* the cash with a page is reset to 0 when it is crawled (this way, each page will accumulate cash infinitely until it eventually has enough that it gets picked.) 2. The implementation in Nutch does not converge. The scores of the pages increase during each iteration; moreover the amount of increase in a page's score increases as the page's score itself increases. 3. Since the scores are used for ranking search results also, the ranking is direly affected by the exaggeration in scores. 4. The page importance computed using this online process is more useful for ordering the crawl than for ranking the search results. The algorithm is a a fix-point computation and will lead to *correct* values for the page importance asymptotically (i.e. at infinite time). Thus, if the score calculation is repeated a few times *after* the crawl and *before* indexing, the scores might be better suited for ranking the results. I am a Nutch newbie and am still trying to understand its nuances. Please excuse me if I have overlooked anything obvious. Thanks, Siddhartha Reddy