Andrzej Bialecki wrote:
What i infer is,
1. For every refetch, the score of files (but not the directory) is
increasing
This is curious, it should not be so. However, it's the same in the
vanilla version of Nutch (without this patch), so we'll address this
separately.
The OPIC algorithm is not really designed for re-fetching. It assumes
that each link is seen only once. When pages are refetched, their links
are processed again. I think the easiest way to fix this would be to
change ParseOutputFormat to not generate STATUS_LINKED crawldata when a
page has been refetched. That way scores would only be adjusted for
links in the original version of a page. This is not perfect, but
considerably better than what happens now. Incrementally updating the
score would require re-processing the parser outputs to find outlinks
from the previous version of the page and then subtracting their
contribution from the page's score. This is possible, but not easy.
Doug
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general