There has been some discussion about the implementation of OPIC in Nutch
vis-a-vis the proposed implementation in the original
paper<http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html>.
See these threads, for example:

http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06539.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg02915.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg04319.html

I would like to know if there are any plans with regards to these
improvements.


Below, I have summarized the main differences in the implementation, as I
understand it. Please correct me if I am wrong or missing something.

1. Nutch does not maintain two values for each node as proposed by Abiteboul
et. al: *cash* and *(credit) history*; instead it maintains only one value
that it calls *score*. The paper proposes that the *cash* be reset to 0 when
it is distributed to the outlinks and the *history* be used as the
importance of the page -- there is no parallel to this in Nutch. In fact
OPICScoringFilter.java contains the following comment by Andrzej Bialecki:

    // XXX (ab) no adjustment? I think this is contrary to the algorithm
descr.
    // XXX in the paper, where page "loses" its score if it's distributed to
    // XXX linked pages...

2. There is no one virtual root page that is assumed to be linked to each
page in Nutch as proposed by the paper.

3. The total cash in the system is not constant in Nutch as it is in the
paper, it continuously increases.


Here is a summary of the problems this causes (again, my understanding):

1. The correctness of the page importance computation in Nutch cannot be
readily established. The OPIC paper does prove its algorithm correct, but
the algorithm implemented in Nutch is different in a way that the proof in
that paper does not hold for this. From the paper: "the correctness of the
algorithm requires that each page is read infinitely many times..." This is
possible even with the *Greedy* policy which Nutch uses for ordering the
crawl *if* the cash with a page is reset to 0 when it is crawled (this way,
each page will accumulate cash infinitely until it eventually has enough
that it gets picked.)

2. The implementation in Nutch does not converge. The scores of the pages
increase during each iteration; moreover the amount of increase in a page's
score increases as the page's score itself increases.

3. Since the scores are used for ranking search results also, the ranking is
direly affected by the exaggeration in scores.

4. The page importance computed using this online process is more useful for
ordering the crawl than for ranking the search results. The algorithm is a a
fix-point computation and will lead to *correct* values for the page
importance asymptotically (i.e. at infinite time). Thus, if the score
calculation is repeated a few times *after* the crawl and *before* indexing,
the scores might be better suited for ranking the results.


I am a Nutch newbie and am still trying to understand its nuances. Please
excuse me if I have overlooked anything obvious.

Thanks,
Siddhartha Reddy

Reply via email to