I'm pretty new to nutch, but in reading through the mail lists and other papers, I don't think I've really seen any discussion on using ndfs with respect to automating end to end workflow for data that is going to be searched (fetch->index->merge->search).
The few crawler designs I'm familiar with typically have spiders (fetchers) and indexers on the same box. Once pages are crawled and indexed the indexes are pipelined to merge/query boxes to complete the workflow. When I look at the nutch design and ndfs, I'm assuming the design intent for 'pure ndfs' workflow is for the webdb to generate segments on a ndfs partition, and once the updating of the webdb is completed, the segments are processed 'on-disk' by the subsequent fetcher/index/merge/query mechanisms. Is this a correct assumption? Automating this kind of continuous workflow usually is dependent on the implementation of some kind of control mechanism to assure that the correct sequence of operations is performed. Are there any recommendations on the best way to automate this workflow when using ndfs? I've prototyped a continuous workflow system using a traditional pipeline model with per stage work queues, and I see how that could be applied to a clustered filesystem like ndfs, but I'm curious to hear what the design intent or best practice is envisioned for automating ndfs based implementations. Thanks, Jay
