James Kennedy wrote:
But back to my original question... Doug suggests that dependence on a driver process is acceptable. But has anyone needed true MapReduce chaining or tried it successfully? Or is it generally accepted that a multi-MapReduce algorithm should always be driven by a single process?
I would argue that this functionality is outside the scope of Hadoop. As far as I understand your question, you need orchestration, which involves the ability to record a state of previously executed map-reduce jobs, and starting next map-reduce jobs based on the existing state, possibly long time after the first job completes and from a different process.
I'm frequently facing this problem, and so far I've been using a poor-man's workflow system, consisting of a bunch of cron jobs, shell scripts, and simple marker files to record current state of data. In a similar way you can implement advisory application-level locking, using lock files.
Example: adding a new batch of pages to a Nutch index involves many steps, starting with fetchlist generation, fetching, parsing, updating the db, extraction of link information, and indexing. Each of these steps consists of one (or several) map-reduce jobs, and the input to the next jobs depends on the output of previous jobs. What you referred to in your previous email was a single-app driver for this workflow, called Crawl. But I'm using the slightly modified individual tools, which on successful completion create marker files (e.g. fetching.done). Other tools check for the existence of these files, and either perform their function or exit (if I want to run updatedb from a segment that is fetched but not parsed).
To summarize this long answer - I think that this functionality belongs in the application layer built on top of Hadoop, and IMHO we are better off not implementing it in the Hadoop proper.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
