Hi Rob, That's a useful Nutch command reference you have there. Could you please put it on Nutch's Wiki? http://wiki.apache.org/nutch/
Maybe somewhere along/in/linked from http://wiki.apache.org/nutch/CommandLineOptions page Otis --- Rob Pettengill <[EMAIL PROTECTED]> wrote: > I am crawling files for a vertical search engine - a case which has > similarities both to the intranet search with CrawlTool and the > internet search workflow in the tutorial. > > I have two sets of questions that may be related: > > 1. In CrawlTool after all the segments are fetched, there is a > comment which reads: > > // Re-fetch everything to get the complete set of incoming > anchor texts > // associated with each page. We should fix this, so that > we can update > // the previously fetched segments with the anchors that are > > now in the > // database, but until that algorithm is written, we > re-fetch. > > Then there are calls to FetchListTool and Fetcher which creates a new > > segment with all of the documents and refetches them, before indexing > > the new combined segment. > > In the internet indexing work flow described in the tutorial there > are no equivalent steps to the refetching steps in the CrawlTool. > > Can someone explain what is going on here? > > ==== > > 2. I ended up with a half dozen segments which I combined into one > new segment using the SegmentMergeTool. Inspired by the workflow > differences described above, as an experiment > I built a new page and link db via a workflow like this: > nutch admin db2 create > nutch updatedb db2 combined segment. > > I then compared the orignial db with db2 using WebDBReader with the - > > stats option. What I found was that db2 only had about half as many > > links as the original db. > > Where did the links go? Doesn't the combined segment have all of the > > original content? > > Do these results correctly imply that a refetch is necessary any time > > segments are combined? If so why? > > ==== > P.S. > > As I've been sorting out Nutch, I found myself constantly going back > > and forth between the Java docs, the java source code, the nutch > script code, and the tutorial. I ended up putting together a > document that summarized the key information extracted from all of > the above for commands that can be issued through the nutch script. > > It can be found at: > > http://mesavida.com/NutchSubCmds.html > > Do you all think that this kind of nutch subcommand reference is > useful? If so where should I put it? > > Thanks, > Rob > > > -- > Rob Pettengill, > > Questions about petroleum? > Goto: http://AskAboutOil.com/ > > > > ------------------------------------------------------- > This SF.Net email is sponsored by the 'Do More With Dual!' webinar > happening > July 14 at 8am PDT/11am EDT. We invite you to explore the latest in > dual > core and dual graphics technology at this free one hour event hosted > by HP, > AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar > _______________________________________________ > Nutch-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-general >
