Re: indexing workflow, refetching, and missing links questions + a nutch subcommand reference

Rob Pettengill Tue, 12 Jul 2005 07:18:07 -0700

Thanks for the cogent explanation!

I'm guessing that adding the link info to the index of earlier pagesfrom pages fetched later in other segments is awkward because theyare not in the index being currently built.

This still leaves me a bit puzzled as to how I could merge segments,rebuild the page and link data base from the merged segment and endup with roughly half the number of links in the new db. It seemsthat there should be no change in the number of links. I'm beginningto suspect pilot error of some kind on my part.


Thanks,
Rob

On 2005, Jul 12, at 8:05 AM, Russell Mayor wrote:

On 7/8/05, Rob Pettengill <[EMAIL PROTECTED]> wrote:

1. In CrawlTool after all the segments are fetched, there is a
comment which reads:

        // Re-fetch everything to get the complete set of incoming
anchor texts
        // associated with each page.  We should fix this, so that
we can update
        // the previously fetched segments with the anchors that are
now in the

// database, but until that algorithm is written, we re-fetch.


Then there are calls to FetchListTool and Fetcher which creates a new
segment with all of the documents and refetches them, before indexing
the new combined segment.

In the internet indexing work flow described in the tutorial there
are no equivalent steps to the refetching steps in the CrawlTool.

Can someone explain what is going on here?


For each page in the page database nutch tracks the anchor text for
links to that page (that is, the text of the links on other pages that
point to that page).  This anchor text forms one of the fields that is
eventually output into the Lucene indexes that the front-end searches.
 (Anchor text is useful in this way because it is often a good, short
description of the page being linked to).

The CrawlTool works by iteratively fetching a set of injected pages,
the pages that they link to, the pages that they link to etc, creating
a segment for each iteration.  However, the earlier segments are
created before the later pages are discovered, and so don't contain
the anchor text for all the links on those later pages that link to
pages in the earler segments.  This is becasue, at the end of fetching
a fetch list, the page database is updated with all the new link
information found during the fetch, but previously generated segments
are left untouched.

The code you have seen in the CrawlTool (which just removes all the
existing segments and refetches all the pages in the entire page
database as one large segment) is a hack so that the final segment
contains the correct anchor text data for each page.  It does this at
the expense of effectively fetching every page twice.

Note that the problem of poor anchor text goes away once pages start
to go stale and are refetched - the CrawlTool only performs as you
have indicated because it is not intended to be used continuously, I
beleive.

Also, the CrawlTool has been changed slightly in the current
development version of nutch so that it does not refetch all the pages
again.  Instead, the previously fetched segments are updated in place.
  This is much more efficient.

Re: indexing workflow, refetching, and missing links questions + a nutch subcommand reference

Reply via email to