performance of multiple map-reduce operations

Chris Dyer Tue, 06 Nov 2007 13:38:34 -0800

Hi Hadoopers,
Many of the computations that I am performing with MapReduce require several
chains of MapReduce operations where the output of one or more previous
reduce steps is the input to a future mapper.  Since JobConf object doesn't
seem to let you specify a chain of jobs (or does it? I may just be
misunderstanding the API), I've just been waiting for JobConf.runJob for one
step to return (which only happens when 100% of the reducers are finished)
and then executing the next job.  But, I'm wondering if there is any way to
make the system take advantage of the time that is currently wasted at the
end of the first job's reduce operation when most of the reducers have
completed but before 100% have finished.  This can be fairly significant.
For one computation I've been working on lately, over 25% of the time is
spent in the last 10% of each map/reduce operation (this has to do with the
natural distribution of my input data and would be unavoidable even given an
optimal partitioning).  During this time, I have dozens of nodes sitting
idle that could be executing the map part of the next job, if only the
framework knew that is was coming.  Has anyone dealt with this or found a
good workaround?


Thanks!
Chris

performance of multiple map-reduce operations

Reply via email to