Hi Is this not a critical problem?
we right now generate segments and refetch pages and any refetched segment will rank relatively higher making search results irrevelant So ultimately relevant results are not returned . Is it ???? Rgds Prabhu On 3/9/06, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Andrzej Bialecki wrote: > >> What i infer is, > >> > >> 1. For every refetch, the score of files (but not the directory) is > >> increasing > >> > > > > > > This is curious, it should not be so. However, it's the same in the > > vanilla version of Nutch (without this patch), so we'll address this > > separately. > > The OPIC algorithm is not really designed for re-fetching. It assumes > that each link is seen only once. When pages are refetched, their links > are processed again. I think the easiest way to fix this would be to > change ParseOutputFormat to not generate STATUS_LINKED crawldata when a > page has been refetched. That way scores would only be adjusted for > links in the original version of a page. This is not perfect, but > considerably better than what happens now. Incrementally updating the > score would require re-processing the parser outputs to find outlinks > from the previous version of the page and then subtracting their > contribution from the page's score. This is possible, but not easy. > > Doug >