Ok guys, thank you for your responses.
Devaraj Das wrote:
I haven't confirmed this but I vaguely remember that the resource schedulers
(Torque, Condor) provide the feature using which one can submit a DAG of
jobs, etc. The resource manager doesn't invoke a node in the DAG unless all
nodes pointing to it have successfully finished (or something like that) and
the resource scheduler framework does the bookkeeping to take care of failed
jobs, etc.
In hadoop there is an effort "Integration of Hadoop with batch schedulers"
https://issues.apache.org/jira/browse/HADOOP-719
I am not sure whether it handles the use case, where one could submit a
chain of jobs, but think it potentially can handle that.
-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Sunday, June 24, 2007 6:10 AM
To: [email protected]
Subject: Re: Examples of chained MapReduce?
We are still evaluating Hadoop for use in our main-line analysis systems,
but we already have the problem of workflow scheduling.
Our solution for that was to implement a simpler version of Amazon's Simple
Queue Service. This allows us to have multiple redundant workers for some
tasks or to choke a task down on other tasks.
The basic idea is that queues contain XML tasks. Tasks are read from the
queue by workers, but are kept in a holding pen for a queue specific time
period after they are read. If the task completes normally, the worker will
delete the task, but if the timeout expires before the worker completes the
task, it is added back to the queue.
Workers are structured as a triple of scripts that are executed of a manager
process. These are a pre-condition that can determine if any work should be
done (usually this is a check for available local disk space or available
CPU cycles), an item qualification (this is done with a particular item in
case the work is subject to resource reservation) and a worker script.
Even this tiny little framework suffices for quite complex workflows and
work constraints. It would be very easy to schedule map-reduce tasks via a
similar mechanism.
On 6/23/07 5:34 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:
James Kennedy wrote:
But back to my original question... Doug suggests that dependence on
a driver process is acceptable. But has anyone needed true MapReduce
chaining or tried it successfully? Or is it generally accepted that
a multi-MapReduce algorithm should always be driven by a single process?
I would argue that this functionality is outside the scope of Hadoop.
As far as I understand your question, you need orchestration, which
involves the ability to record a state of previously executed
map-reduce jobs, and starting next map-reduce jobs based on the
existing state, possibly long time after the first job completes and
from a different process.
I'm frequently facing this problem, and so far I've been using a
poor-man's workflow system, consisting of a bunch of cron jobs, shell
scripts, and simple marker files to record current state of data. In a
similar way you can implement advisory application-level locking,
using lock files.
Example: adding a new batch of pages to a Nutch index involves many
steps, starting with fetchlist generation, fetching, parsing, updating
the db, extraction of link information, and indexing. Each of these
steps consists of one (or several) map-reduce jobs, and the input to
the next jobs depends on the output of previous jobs. What you
referred to in your previous email was a single-app driver for this
workflow, called Crawl. But I'm using the slightly modified individual
tools, which on successful completion create marker files (e.g.
fetching.done). Other tools check for the existence of these files,
and either perform their function or exit (if I want to run updatedb
from a segment that is fetched but not parsed).
To summarize this long answer - I think that this functionality
belongs in the application layer built on top of Hadoop, and IMHO we
are better off not implementing it in the Hadoop proper.