I am crawling files for a vertical search engine - a case which has similarities both to the intranet search with CrawlTool and the internet search workflow in the tutorial.

I have two sets of questions that may be related:

1. In CrawlTool after all the segments are fetched, there is a comment which reads:

// Re-fetch everything to get the complete set of incoming anchor texts // associated with each page. We should fix this, so that we can update // the previously fetched segments with the anchors that are now in the
        // database, but until that algorithm is written, we re-fetch.

Then there are calls to FetchListTool and Fetcher which creates a new segment with all of the documents and refetches them, before indexing the new combined segment.

In the internet indexing work flow described in the tutorial there are no equivalent steps to the refetching steps in the CrawlTool.

Can someone explain what is going on here?

====

2. I ended up with a half dozen segments which I combined into one new segment using the SegmentMergeTool. Inspired by the workflow differences described above, as an experiment
I built a new page and link db via a workflow like this:
nutch admin db2 create
nutch updatedb db2 combined segment.

I then compared the orignial db with db2 using WebDBReader with the - stats option. What I found was that db2 only had about half as many links as the original db.

Where did the links go? Doesn't the combined segment have all of the original content?

Do these results correctly imply that a refetch is necessary any time segments are combined? If so why?

====
P.S.

As I've been sorting out Nutch, I found myself constantly going back and forth between the Java docs, the java source code, the nutch script code, and the tutorial. I ended up putting together a document that summarized the key information extracted from all of the above for commands that can be issued through the nutch script. It can be found at:

http://mesavida.com/NutchSubCmds.html

Do you all think that this kind of nutch subcommand reference is useful? If so where should I put it?

Thanks,
Rob


--
Rob Pettengill,

Questions about petroleum?
    Goto:   http://AskAboutOil.com/

Reply via email to