I am crawling files for a vertical search engine - a case which has
similarities both to the intranet search with CrawlTool and the
internet search workflow in the tutorial.
I have two sets of questions that may be related:
1. In CrawlTool after all the segments are fetched, there is a
comment which reads:
// Re-fetch everything to get the complete set of incoming
anchor texts
// associated with each page. We should fix this, so that
we can update
// the previously fetched segments with the anchors that are
now in the
// database, but until that algorithm is written, we re-fetch.
Then there are calls to FetchListTool and Fetcher which creates a new
segment with all of the documents and refetches them, before indexing
the new combined segment.
In the internet indexing work flow described in the tutorial there
are no equivalent steps to the refetching steps in the CrawlTool.
Can someone explain what is going on here?
====
2. I ended up with a half dozen segments which I combined into one
new segment using the SegmentMergeTool. Inspired by the workflow
differences described above, as an experiment
I built a new page and link db via a workflow like this:
nutch admin db2 create
nutch updatedb db2 combined segment.
I then compared the orignial db with db2 using WebDBReader with the -
stats option. What I found was that db2 only had about half as many
links as the original db.
Where did the links go? Doesn't the combined segment have all of the
original content?
Do these results correctly imply that a refetch is necessary any time
segments are combined? If so why?
====
P.S.
As I've been sorting out Nutch, I found myself constantly going back
and forth between the Java docs, the java source code, the nutch
script code, and the tutorial. I ended up putting together a
document that summarized the key information extracted from all of
the above for commands that can be issued through the nutch script.
It can be found at:
http://mesavida.com/NutchSubCmds.html
Do you all think that this kind of nutch subcommand reference is
useful? If so where should I put it?
Thanks,
Rob
--
Rob Pettengill,
Questions about petroleum?
Goto: http://AskAboutOil.com/