Re: Automating workflow using ndfs

AJ Chen Fri, 02 Sep 2005 02:04:29 -0700

I'm also thinking about implementing an automated workflow offetchlist->crawl->updateDb->index. Although my project may not requireNDSF because it only concerns about deep crawling of 100,000 sites, anappropriate workflow is still needed to automatically take care offailed urls, newly-added urls, daily update, etc. Appreciate it ifsomebody can share experience on design of the workflow.

The nutch intranet crawler (or site-specific crawler, which I prefer tocall) is an automated process, but it's designed to conveniently dealwith just a handful of sites. With a larger number of selected sites, Iexpect a modified version is needed. One modification I can think of isto create a lookup table in the urlfilter object for domains to becrawled and their corresponding regular expressions. The goal is toavoid entering 100,000 regex in the craw-urlfilter.xml and checking ALLthese regex for each URL. Any comment?


thanks,
-AJ


Jay Lorenzo wrote:

Thanks, that's good information - it sounds like I need to take a closerlook at index deployment to see what the best solution is for automatingindex management.
The initial email was more about understanding what the envisioned workflowwould for automating the creation of those indexes in a NDFS system, meaningwhatchoices are available for automating the workflow offetchlist->crawl->updateDb->indexpart of the equation when you have a node hosting a webdb, and a number ofnodescrawling and indexing.If I use a message based system, I assume I would create new fetchlists at agivenlocations of the NDFS, and message the fetchers where to find thefetchlists. Once crawled,I need to then update the webdb with the links discovered during the crawl.
Maybe this is too complex of a solution, but my sense is that map-reducesystems still need a wayto manage the workflow/control that needs to occur if you want to createpipelines thatgenerate indexes.
Thanks,

Jay Lorenzo

On 8/31/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
I assume that in most NDFS-based configurations the production search
system will not run out of NDFS. Rather, indexes will be created
offline for a deployment (i.e., merging things to create an indexperactually
search node), then copied out of NDFS to the local filesystem on a
production search node and placed in production. This can be done
incrementally, where new indexes are deployed without re-deploying old
indexes. In this scenario, new indexes are rotated in replacing old
indexes, and the .del file for every index is updated, to reflect
deduping. There is no code yet which implements this.

Is this what you were asking?

Doug


Jay Lorenzo wrote:
I'm pretty new to nutch, but in reading through the mail lists and other
papers, I don't think I've really seen any discussion on using ndfs with
respect to automating end to end workflow for data that is going to be
searched (fetch->index->merge->search).

The few crawler designs I'm familiar with typically have spiders
(fetchers) and
indexers on the same box. Once pages are crawled and indexed the indexes
are pipelined to merge/query boxes to complete the workflow.

When I look at the nutch design and ndfs, I'm assuming the design intent
for 'pure ndfs' workflow is for the webdb to generate segments on a ndfs
partition, and once the updating of the webdb is completed, the segments
are processed 'on-disk' by the subsequent
fetcher/index/merge/query mechanisms. Is this a correct assumption?

Automating this kind of continuous workflow usually is dependent on the
implementation of some kind of control mechanism to assure that the
correct sequence of operations is performed.

Are there any recommendations on the best way to automate this
workflow when using ndfs? I've prototyped a continuous workflow system
using a traditional pipeline model with per stage work queues, and I see
how that could be applied to a clustered filesystem like ndfs, but I'm
curious to hear what the design intent or best practice is envisioned
for automating ndfs based implementations.


Thanks,

Jay


--
AJ (Anjun) Chen, Ph.D.

Canova BioconsultingMarketing * BD * Software Development

748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---------------------------------------------------

Re: Automating workflow using ndfs

Reply via email to