On 7/8/05, Rob Pettengill <[EMAIL PROTECTED]> wrote: > 1. In CrawlTool after all the segments are fetched, there is a > comment which reads: > > // Re-fetch everything to get the complete set of incoming > anchor texts > // associated with each page. We should fix this, so that > we can update > // the previously fetched segments with the anchors that are > now in the > // database, but until that algorithm is written, we re-fetch. > > Then there are calls to FetchListTool and Fetcher which creates a new > segment with all of the documents and refetches them, before indexing > the new combined segment. > > In the internet indexing work flow described in the tutorial there > are no equivalent steps to the refetching steps in the CrawlTool. > > Can someone explain what is going on here? >
For each page in the page database nutch tracks the anchor text for links to that page (that is, the text of the links on other pages that point to that page). This anchor text forms one of the fields that is eventually output into the Lucene indexes that the front-end searches. (Anchor text is useful in this way because it is often a good, short description of the page being linked to). The CrawlTool works by iteratively fetching a set of injected pages, the pages that they link to, the pages that they link to etc, creating a segment for each iteration. However, the earlier segments are created before the later pages are discovered, and so don't contain the anchor text for all the links on those later pages that link to pages in the earler segments. This is becasue, at the end of fetching a fetch list, the page database is updated with all the new link information found during the fetch, but previously generated segments are left untouched. The code you have seen in the CrawlTool (which just removes all the existing segments and refetches all the pages in the entire page database as one large segment) is a hack so that the final segment contains the correct anchor text data for each page. It does this at the expense of effectively fetching every page twice. Note that the problem of poor anchor text goes away once pages start to go stale and are refetched - the CrawlTool only performs as you have indicated because it is not intended to be used continuously, I beleive. Also, the CrawlTool has been changed slightly in the current development version of nutch so that it does not refetch all the pages again. Instead, the previously fetched segments are updated in place. This is much more efficient.
