Andrzej Bialecki wrote:
What i infer is,

   1. For every refetch, the score of files (but not the directory) is
   increasing


This is curious, it should not be so. However, it's the same in the vanilla version of Nutch (without this patch), so we'll address this separately.

The OPIC algorithm is not really designed for re-fetching. It assumes that each link is seen only once. When pages are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only be adjusted for links in the original version of a page. This is not perfect, but considerably better than what happens now. Incrementally updating the score would require re-processing the parser outputs to find outlinks from the previous version of the page and then subtracting their contribution from the page's score. This is possible, but not easy.

Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to