Siddhartha Reddy wrote:
There has been some discussion about the implementation of OPIC in Nutch
vis-a-vis the proposed implementation in the original
paper<http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html>.
See these threads, for example:

http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06539.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg02915.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg04319.html

I would like to know if there are any plans with regards to these
improvements.

Yes. One concrete plan is to introduce (again) a separate tool to perform offline score calculations. I should be ready with the initial code within the next week or two. I'm not sure yet how abstract the API will be, but my intention is that it should be possible to run different ranking algorithms, initially in-degree and pagerank, then perhaps OPIC if we can fix it.


Below, I have summarized the main differences in the implementation, as I
understand it. Please correct me if I am wrong or missing something.

1. Nutch does not maintain two values for each node as proposed by Abiteboul
et. al: *cash* and *(credit) history*; instead it maintains only one value
that it calls *score*. The paper proposes that the *cash* be reset to 0 when
it is distributed to the outlinks and the *history* be used as the
importance of the page -- there is no parallel to this in Nutch. In fact
OPICScoringFilter.java contains the following comment by Andrzej Bialecki:

    // XXX (ab) no adjustment? I think this is contrary to the algorithm
descr.
    // XXX in the paper, where page "loses" its score if it's distributed to
    // XXX linked pages...

Right :) This is easy to fix.


2. There is no one virtual root page that is assumed to be linked to each
page in Nutch as proposed by the paper.

... and this was difficult to implement efficiently until recently, with the upgrade to Hadoop 0.16. The virtual node plays an important role in case of dangling nodes (pages with no outlinks) - for such pages all cash goes to the virtual node and is then redistributed to all other nodes.

When running a map-reduce job it's difficult to collect side-effects (like the accumulated cash from dangling nodes) across all tasks. It's possible to do this with Hadoop Counters, but before 0.16 it was not easy to read them back from a finished job :) Now it's possible to do this, so we can fix this too.


3. The total cash in the system is not constant in Nutch as it is in the
paper, it continuously increases.

With the two fixes above it should be constant. See this page http://wiki.apache.org/nutch/FixingOpicScoring and the presentation at the bottom of the page.

Here is a summary of the problems this causes (again, my understanding):

1. The correctness of the page importance computation in Nutch cannot be
readily established. The OPIC paper does prove its algorithm correct, but
the algorithm implemented in Nutch is different in a way that the proof in
that paper does not hold for this. From the paper: "the correctness of the
algorithm requires that each page is read infinitely many times..." This is
possible even with the *Greedy* policy which Nutch uses for ordering the
crawl *if* the cash with a page is reset to 0 when it is crawled (this way,
each page will accumulate cash infinitely until it eventually has enough
that it gets picked.)

2. The implementation in Nutch does not converge. The scores of the pages
increase during each iteration; moreover the amount of increase in a page's
score increases as the page's score itself increases.

3. Since the scores are used for ranking search results also, the ranking is
direly affected by the exaggeration in scores.

4. The page importance computed using this online process is more useful for
ordering the crawl than for ranking the search results. The algorithm is a a
fix-point computation and will lead to *correct* values for the page
importance asymptotically (i.e. at infinite time). Thus, if the score
calculation is repeated a few times *after* the crawl and *before* indexing,
the scores might be better suited for ranking the results.


... which defeats the point of an _online_ computation. In my understanding this is also the main weakness of the OPIC, that the convergence takes many cycles and if the graph is changing then some scores will converge quicker and some may never converge. Authors of the OPIC papers recognize this issue, that's why they use the long history to smooth out the changes - but this only means that the convergence is even slower, i.e. in a sense it's hiding the problem.

I am a Nutch newbie and am still trying to understand its nuances. Please
excuse me if I have overlooked anything obvious.

This is an excellent summary of the current problems with OPIC.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to