Reduce tasks fail too easily because of repeated fetch failures
---------------------------------------------------------------

                 Key: HADOOP-2220
                 URL: https://issues.apache.org/jira/browse/HADOOP-2220
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.16.0
            Reporter: Christian Kunz


Currently reduce tasks with more than MAX_FAILED_UNIQUE_FETCHES (= 5 
hard-coded) failures to fetch output from different mappers will fail (I 
believe, introduced in HADOOP-1158)

This gives us some problems with longer running jobs with a large number of 
mappers in multiple waves:
Otherwise problem-less reduce tasks fail because of too many fetch failures due 
to resource contention, and new reduce tasks have to fetch all data from the 
already successfully executed mappers, introducing a lot of additional IO 
overhead. Also, the job will fail when the same reducer exhausts the maximum 
number of attempts.

The limit should be a function of number of mappers and/or waves of mappers, 
and should be more conservative (e.g. no need to let them fail when there are 
enough slots to start speculatively executed reducers and speculative execution 
is enabled). Also, we might consider to not count such a restart against the 
number of attempts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to