[
https://issues.apache.org/jira/browse/NUTCH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120982#comment-13120982
]
Andrzej Bialecki commented on NUTCH-1124:
------------------------------------------
Our implementation is most definitely inaccurate (broken?), though I'm not sure
if the original OPIC algorithm is better.
The original OPIC paper explains that each node needs to give away all its
cash, and then receive cash from other nodes, but in their experiments this led
to a yo-yo instability of large amounts of cash floating in and out, in
response to changes in the graph and the fact that there is a delay of a full
re-crawl cycle, i.e. all known urls need to be re-crawled in order to collect
and redistribute all cash that is potentially floating in the graph. In order
to dampen this effect they added buffering - a history of the latest N scores,
and they would consider an average of these scores. This resulted in smoothing
and dampening of changes, but it's an artificial hack that is sensitive to the
dynamics of changes in the webgraph and the speed of re-crawl.
Our implementation of OPIC doesn't give away cash at all, instead it duplicates
it and then distributes, which causes the total amount of cash floating in a
webgraph to double in each cycle even when a graph is static. We could fix this
by giving away all cash and then introducing a mechanism to collect all cash
from dangling nodes (without outlinks) to redistribute it evenly to all nodes.
This would bring us closer to the original OPIC without smoothing. Still, I
expect the same instability would occur, especially in the face of a changing
graph.
> JUnit test for scoring-opic
> ---------------------------
>
> Key: NUTCH-1124
> URL: https://issues.apache.org/jira/browse/NUTCH-1124
> Project: Nutch
> Issue Type: Sub-task
> Components: build
> Affects Versions: 1.4
> Reporter: Lewis John McGibbney
> Priority: Minor
> Fix For: 1.5
>
>
> This issue is part of the larger attempt to provide a Junit test case for
> every Nutch plugin.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira