Speculative execution algorithm in 1.0 is too pessimistic in many cases
-----------------------------------------------------------------------

                 Key: MAPREDUCE-3895
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3895
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: jobtracker, performance
    Affects Versions: 1.0.0
            Reporter: Nathan Roberts


We are seeing many instances where largish jobs are ending up with 30-50% of 
reduce tasks being speculatively re-executed. This can be a significant drain 
on cluster resources. 

The primary reason is due to the way progress in the reduce phase can make huge 
jumps in a very short amount of time. This fact leads the speculative execution 
code to think lots of tasks have fallen way behind the average when in fact 
they haven't

The important piece of the algorithm is essentially:
* Am I more than 20% behind the average progress?
* Have I been running for at least a minute?
* Have any tasks completed yet?

Unfortunately, a set of reduce tasks which spend a couple of minutes in the 
Copy phase, and very little time in the Sort phase, will trigger all these 
conditions for a large percentage of the reduce tasks. (the tasks' progress 
jump from 33% to 66% almost instantly which then triggers the speculation). 
I've seen this on several very large jobs which spend about 2 minutes in Copy, 
a few seconds in Sort, and 40 minutes in Reduce. These jobs launch about 30-40% 
additional reduce tasks which then run for almost the full 40 minutes. 

This area becomes more plugable in MRv2 but for 1.0 it would be good if some 
portion of this algorithm could be configurable so that a job could have some 
degree of control (just disabling speculative execution is not really an 
option). 
 




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to