Hi Rob,

That's a useful Nutch command reference you have there.  Could you
please put it on Nutch's Wiki?  http://wiki.apache.org/nutch/

Maybe somewhere along/in/linked from
http://wiki.apache.org/nutch/CommandLineOptions page

Otis


--- Rob Pettengill <[EMAIL PROTECTED]> wrote:

> I am crawling files for a vertical search engine - a case which has  
> similarities both to the intranet search with CrawlTool and the  
> internet search workflow in the tutorial.
> 
> I have two sets of questions that may be related:
> 
> 1. In CrawlTool after all the segments are fetched, there is a  
> comment which reads:
> 
>          // Re-fetch everything to get the complete set of incoming  
> anchor texts
>          // associated with each page.  We should fix this, so that  
> we can update
>          // the previously fetched segments with the anchors that are
>  
> now in the
>          // database, but until that algorithm is written, we
> re-fetch.
> 
> Then there are calls to FetchListTool and Fetcher which creates a new
>  
> segment with all of the documents and refetches them, before indexing
>  
> the new combined segment.
> 
> In the internet indexing work flow described in the tutorial there  
> are no equivalent steps to the refetching steps in the CrawlTool.
> 
> Can someone explain what is going on here?
> 
> ====
> 
> 2. I ended up with a half dozen segments which I combined into one  
> new segment using the SegmentMergeTool.  Inspired by the workflow  
> differences described above, as an experiment
> I built a new page and link db via a workflow like this:
> nutch admin db2 create
> nutch updatedb db2 combined segment.
> 
> I then compared the orignial db with db2 using WebDBReader with the -
> 
> stats option.  What I found was that db2 only had about half as many 
> 
> links as the original db.
> 
> Where did the links go?  Doesn't the combined segment have all of the
>  
> original content?
> 
> Do these results correctly imply that a refetch is necessary any time
>  
> segments are combined?  If so why?
> 
> ====
> P.S.
> 
> As I've been sorting out Nutch, I found myself constantly going back 
> 
> and forth between the Java docs, the java source code, the nutch  
> script code, and the tutorial.  I ended up putting together a  
> document that summarized the key information extracted from all of  
> the above for commands that can be issued through the nutch script.  
> 
> It can be found at:
> 
> http://mesavida.com/NutchSubCmds.html
> 
> Do you all think that this kind of nutch subcommand reference is  
> useful?  If so where should I put it?
> 
> Thanks,
> Rob
> 
> 
> --
> Rob Pettengill,
> 
> Questions about petroleum?
>      Goto:   http://AskAboutOil.com/
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by the 'Do More With Dual!' webinar
> happening
> July 14 at 8am PDT/11am EDT. We invite you to explore the latest in
> dual
> core and dual graphics technology at this free one hour event hosted
> by HP, 
> AMD, and NVIDIA.  To register visit http://www.hp.com/go/dualwebinar
> _______________________________________________
> Nutch-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 

Reply via email to