Thanks for your clarification, Todd and Steve. I agree with that simplicity and reliability often go hand in hand. Keeping state between jobs may be problematic because when some job fails the whole job sequence should be restarted. However, how about this scenario: each job in the sequence do exactly the same thing but with different data/parameters. How long the sequence will be will depend on the status of current job. Each job can keep some checkpoints of the current status. In this way, if some job fails, we can re-run the whole job sequence starting from the latest checkpoint instead of the beginning. It will be nice if all the sequence (it`s actually a loop which happens in many of the machine learning algorithms, each loop contains a Map and Reduce step. PageRank calculation is one of the examples) can be done in a single job. Because if an algorithm takes 1000 steps to converge, we have to start 1000 jobs in the job sequence way, which is costly since of the start/stop of jobs. Do you think all these are feasible with hadoop?
I will take a look at the pagerank on mapreduce. Thank you all again. Thanks, Jianmin ________________________________ From: Steve Loughran <[email protected]> To: [email protected] Sent: Monday, June 1, 2009 10:31:14 PM Subject: Re: question about when shuffle/sort start working Todd Lipcon wrote: > Hi Jianmin, > > This is not (currently) supported by Hadoop (or Google's MapReduce either > afaik). What you're looking for sounds like something more like Microsoft's > Dryad. > > One thing that is supported in versions of Hadoop after 0.19 is JVM reuse. > If you enable this feature, task trackers will persist JVMs between jobs. > You can then persist some state in static variables. > > I'd caution you, however, from making too much use of this fact as anything > but an optimization. The reason that Hadoop is limited to MR (or M+RM* as > you said) is that simplicity and reliability often go hand in hand. If you > start maintaining important state in RAM on the tasktracker JVMs, and one of > them goes down, you may need to restart your entire job sequence from the > top. In typical MapReduce, you may need to rerun a mapper or a reducer, but > the state is all on disk ready to go. > > -Todd > I'd thought the question is not necessarily one of maintaining state, but of chaining the output from one job into another, where the # of iterations depends on the outcome of the previous set. Funnily enough, this is what you (apparently) end up having to do when implementing PageRank-like ranking as MR jobs: http://skillsmatter.com/podcast/cloud-grid/having-fun-with-pagerank-and-mapreduce
