Re: indexing workflow, refetching, and missing links questions + a nutch subcommand reference

Russell Mayor Tue, 12 Jul 2005 06:06:14 -0700

On 7/8/05, Rob Pettengill <[EMAIL PROTECTED]> wrote:
> 1. In CrawlTool after all the segments are fetched, there is a
> comment which reads:
> 
>         // Re-fetch everything to get the complete set of incoming
> anchor texts
>         // associated with each page.  We should fix this, so that
> we can update
>         // the previously fetched segments with the anchors that are
> now in the
>         // database, but until that algorithm is written, we re-fetch.
> 
> Then there are calls to FetchListTool and Fetcher which creates a new
> segment with all of the documents and refetches them, before indexing
> the new combined segment.
> 
> In the internet indexing work flow described in the tutorial there
> are no equivalent steps to the refetching steps in the CrawlTool.
> 
> Can someone explain what is going on here?
>


For each page in the page database nutch tracks the anchor text for
links to that page (that is, the text of the links on other pages that
point to that page).  This anchor text forms one of the fields that is
eventually output into the Lucene indexes that the front-end searches.
 (Anchor text is useful in this way because it is often a good, short
description of the page being linked to).

The CrawlTool works by iteratively fetching a set of injected pages,
the pages that they link to, the pages that they link to etc, creating
a segment for each iteration.  However, the earlier segments are
created before the later pages are discovered, and so don't contain
the anchor text for all the links on those later pages that link to
pages in the earler segments.  This is becasue, at the end of fetching
a fetch list, the page database is updated with all the new link
information found during the fetch, but previously generated segments
are left untouched.

The code you have seen in the CrawlTool (which just removes all the
existing segments and refetches all the pages in the entire page
database as one large segment) is a hack so that the final segment
contains the correct anchor text data for each page.  It does this at
the expense of effectively fetching every page twice.

Note that the problem of poor anchor text goes away once pages start
to go stale and are refetched - the CrawlTool only performs as you
have indicated because it is not intended to be used continuously, I
beleive.

Also, the CrawlTool has been changed slightly in the current
development version of nutch so that it does not refetch all the pages
again.  Instead, the previously fetched segments are updated in place.
  This is much more efficient.

Re: indexing workflow, refetching, and missing links questions + a nutch subcommand reference

Reply via email to