Chris Dyer wrote:
For one computation I've been working on lately, over 25% of the time is
spent in the last 10% of each map/reduce operation (this has to do with the
natural distribution of my input data and would be unavoidable even given an
optimal partitioning).  During this time, I have dozens of nodes sitting
idle that could be executing the map part of the next job, if only the
framework knew that is was coming.  Has anyone dealt with this or found a
good workaround?

If your next job depends on the output of the prior job, then you need to wait for the prior to complete. But if your next job is independent, you can submit it right away, and its map tasks will run as the reduce tasks are running for the prior job.

Doug

Reply via email to