RE: performance of multiple map-reduce operations

Joydeep Sen Sarma Tue, 06 Nov 2007 16:30:28 -0800

This has come up a few times. There was an interesting post a while back
on a prototype to chain map-reduce jobs together - which is what u are
looking for really. See:


http://www.mail-archive.com/[email protected]/msg02773.html

curious how mature this prototype is and any plans to integrate in
hadoop?

One of the controversies is whether in the presence of failures, this
makes performance worse rather than better (kind of like udp vs. tcp -
what's better depends on error rate). The probability of a failure per
job will increase non-linearly as the number of nodes involved per job
increases. So what might make sense for small clusters may not make
sense for bigger ones. But it sure would be nice to have this option.

Joydeep


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Chris
Dyer
Sent: Tuesday, November 06, 2007 1:38 PM
To: [email protected]
Subject: performance of multiple map-reduce operations

Hi Hadoopers,
Many of the computations that I am performing with MapReduce require
several
chains of MapReduce operations where the output of one or more previous
reduce steps is the input to a future mapper.  Since JobConf object
doesn't
seem to let you specify a chain of jobs (or does it? I may just be
misunderstanding the API), I've just been waiting for JobConf.runJob for
one
step to return (which only happens when 100% of the reducers are
finished)
and then executing the next job.  But, I'm wondering if there is any way
to
make the system take advantage of the time that is currently wasted at
the
end of the first job's reduce operation when most of the reducers have
completed but before 100% have finished.  This can be fairly
significant.
For one computation I've been working on lately, over 25% of the time is
spent in the last 10% of each map/reduce operation (this has to do with
the
natural distribution of my input data and would be unavoidable even
given an
optimal partitioning).  During this time, I have dozens of nodes sitting
idle that could be executing the map part of the next job, if only the
framework knew that is was coming.  Has anyone dealt with this or found
a
good workaround?

Thanks!
Chris

RE: performance of multiple map-reduce operations

Reply via email to