Re: Examples of chained MapReduce?

James Kennedy Mon, 25 Jun 2007 09:30:34 -0700

Ok guys, thank you for your responses.

Devaraj Das wrote:

I haven't confirmed this but I vaguely remember that the resource schedulers
(Torque, Condor) provide the feature using which one can submit a DAG of
jobs, etc. The resource manager doesn't invoke a node in the DAG unless all
nodes pointing to it have successfully finished (or something like that) and
the resource scheduler framework does the bookkeeping to take care of failed
jobs, etc.
In hadoop there is an effort "Integration of Hadoop with batch schedulers"
https://issues.apache.org/jira/browse/HADOOP-719I am not sure whether it handles the use case, where one could submit a
chain of jobs, but think it potentially can handle that.
-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED]Sent: Sunday, June 24, 2007 6:10 AM
To: [email protected]
Subject: Re: Examples of chained MapReduce?


We are still evaluating Hadoop for use in our main-line analysis systems,
but we already have the problem of workflow scheduling.

Our solution for that was to implement a simpler version of Amazon's Simple
Queue Service.  This allows us to have multiple redundant workers for some
tasks or to choke a task down on other tasks.

The basic idea is that queues contain XML tasks.  Tasks are read from the
queue by workers, but are kept in a holding pen for a queue specific time
period after they are read.  If the task completes normally, the worker will
delete the task, but if the timeout expires before the worker completes the
task, it is added back to the queue.

Workers are structured as a triple of scripts that are executed of a manager
process.  These are a pre-condition that can determine if any work should be
done (usually this is a check for available local disk space or available
CPU cycles), an item qualification (this is done with a particular item in
case the work is subject to resource reservation) and a worker script.

Even this tiny little framework suffices for quite complex workflows and
work constraints.  It would be very easy to schedule map-reduce tasks via a
similar mechanism.

On 6/23/07 5:34 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:
James Kennedy wrote:
But back to my original question... Doug suggests that dependence ona driver process is acceptable. But has anyone needed true MapReducechaining or tried it successfully? Or is it generally accepted thata multi-MapReduce algorithm should always be driven by a single process?
I would argue that this functionality is outside the scope of Hadoop.As far as I understand your question, you need orchestration, whichinvolves the ability to record a state of previously executedmap-reduce jobs, and starting next map-reduce jobs based on theexisting state, possibly long time after the first job completes andfrom a different process.
I'm frequently facing this problem, and so far I've been using apoor-man's workflow system, consisting of a bunch of cron jobs, shellscripts, and simple marker files to record current state of data. In asimilar way you can implement advisory application-level locking,using lock files.
Example: adding a new batch of pages to a Nutch index involves manysteps, starting with fetchlist generation, fetching, parsing, updatingthe db, extraction of link information, and indexing. Each of thesesteps consists of one (or several) map-reduce jobs, and the input tothe next jobs depends on the output of previous jobs. What youreferred to in your previous email was a single-app driver for thisworkflow, called Crawl. But I'm using the slightly modified individualtools, which on successful completion create marker files (e.g.fetching.done). Other tools check for the existence of these files,and either perform their function or exit (if I want to run updatedbfrom a segment that is fetched but not parsed).
To summarize this long answer - I think that this functionalitybelongs in the application layer built on top of Hadoop, and IMHO weare better off not implementing it in the Hadoop proper.

Re: Examples of chained MapReduce?

Reply via email to