Re: question about when shuffle/sort start working

Jianmin Woo Mon, 01 Jun 2009 19:50:25 -0700

Thanks for your clarification, Todd and Steve.

I agree with that  simplicity and reliability often go hand in hand. Keeping 
state between jobs may be problematic because when some job fails the whole job 
sequence should be restarted. However, how about this scenario: each job in the 
sequence do exactly the same thing but with different data/parameters. How long 
the sequence will be will depend on the status of current job. Each job can 
keep some checkpoints of the current status. In this way, if some job fails, we 
can re-run the whole job sequence starting from the latest checkpoint instead 
of the beginning. It will be nice if all the sequence (it`s actually a loop 
which happens in many of the machine learning algorithms, each loop contains a 
Map and Reduce step. PageRank calculation is one of the examples) can be done 
in a single job. Because if an algorithm takes 1000 steps to converge, we have 
to start 1000 jobs in the job sequence way, which is costly since of the 
start/stop of jobs. Do you think
 all these are feasible with hadoop?


I will take a look at the pagerank on mapreduce. Thank you all again.

Thanks,
Jianmin




________________________________
From: Steve Loughran <[email protected]>
To: [email protected]
Sent: Monday, June 1, 2009 10:31:14 PM
Subject: Re: question about when shuffle/sort start working

Todd Lipcon wrote:
> Hi Jianmin,
> 
> This is not (currently) supported by Hadoop (or Google's MapReduce either
> afaik). What you're looking for sounds like something more like Microsoft's
> Dryad.
> 
> One thing that is supported in versions of Hadoop after 0.19 is JVM reuse.
> If you enable this feature, task trackers will persist JVMs between jobs.
> You can then persist some state in static variables.
> 
> I'd caution you, however, from making too much use of this fact as anything
> but an optimization. The reason that Hadoop is limited to MR (or M+RM* as
> you said) is that simplicity and reliability often go hand in hand. If you
> start maintaining important state in RAM on the tasktracker JVMs, and one of
> them goes down, you may need to restart your entire job sequence from the
> top. In typical MapReduce, you may need to rerun a mapper or a reducer, but
> the state is all on disk ready to go.
> 
> -Todd
> 


I'd thought the question is not necessarily one of maintaining state, but of 
chaining the output from one job into another, where the # of iterations 
depends on the outcome of the previous set. Funnily enough, this is what you 
(apparently) end up having to do when implementing PageRank-like ranking as MR 
jobs:
http://skillsmatter.com/podcast/cloud-grid/having-fun-with-pagerank-and-mapreduce

Re: question about when shuffle/sort start working

Reply via email to