This has come up a few times. There was an interesting post a while back on a prototype to chain map-reduce jobs together - which is what u are looking for really. See:
http://www.mail-archive.com/[email protected]/msg02773.html curious how mature this prototype is and any plans to integrate in hadoop? One of the controversies is whether in the presence of failures, this makes performance worse rather than better (kind of like udp vs. tcp - what's better depends on error rate). The probability of a failure per job will increase non-linearly as the number of nodes involved per job increases. So what might make sense for small clusters may not make sense for bigger ones. But it sure would be nice to have this option. Joydeep -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Chris Dyer Sent: Tuesday, November 06, 2007 1:38 PM To: [email protected] Subject: performance of multiple map-reduce operations Hi Hadoopers, Many of the computations that I am performing with MapReduce require several chains of MapReduce operations where the output of one or more previous reduce steps is the input to a future mapper. Since JobConf object doesn't seem to let you specify a chain of jobs (or does it? I may just be misunderstanding the API), I've just been waiting for JobConf.runJob for one step to return (which only happens when 100% of the reducers are finished) and then executing the next job. But, I'm wondering if there is any way to make the system take advantage of the time that is currently wasted at the end of the first job's reduce operation when most of the reducers have completed but before 100% have finished. This can be fairly significant. For one computation I've been working on lately, over 25% of the time is spent in the last 10% of each map/reduce operation (this has to do with the natural distribution of my input data and would be unavoidable even given an optimal partitioning). During this time, I have dozens of nodes sitting idle that could be executing the map part of the next job, if only the framework knew that is was coming. Has anyone dealt with this or found a good workaround? Thanks! Chris
