Hi

Is this not a critical problem?

we right now generate segments and refetch pages and any refetched segment
will rank relatively higher making search results irrevelant


So ultimately relevant results are not returned . Is it ????


Rgds
Prabhu


On 3/9/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> Andrzej Bialecki wrote:
> >> What i infer is,
> >>
> >>    1. For every refetch, the score of files (but not the directory) is
> >>    increasing
> >>
> >
> >
> > This is curious, it should not be so. However, it's the same in the
> > vanilla version of Nutch (without this patch), so we'll address this
> > separately.
>
> The OPIC algorithm is not really designed for re-fetching.  It assumes
> that each link is seen only once.  When pages are refetched, their links
> are processed again.  I think the easiest way to fix this would be to
> change ParseOutputFormat to not generate STATUS_LINKED crawldata when a
> page has been refetched.  That way scores would only be adjusted for
> links in the original version of a page.  This is not perfect, but
> considerably better than what happens now.  Incrementally updating the
> score would require re-processing the parser outputs to find outlinks
> from the previous version of the page and then subtracting their
> contribution from the page's score.  This is possible, but not easy.
>
> Doug
>

Reply via email to