Ignored exceptions from MapOutputLocation.java:getFile lead to hung reduces
---------------------------------------------------------------------------

                 Key: HADOOP-1246
                 URL: https://issues.apache.org/jira/browse/HADOOP-1246
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.12.3
            Reporter: Arun C Murthy


Ignoring exceptions during fetching of map outputs in 
MapOutputLocation.java:getFile (e.g. content-length doesn't match actual data 
recieved) leads to hung reduces since the MapOutputCopier just ignores them, 
puts the host in the penalty box and retries forever.

Possible steps:
a) Distinguish between failure to fetch output v/s lost maps. (related to 
HADOOP-1158)
b) Ensure the reduce doesn't keep fetching from 'lost maps'. (related to 
HADOOP-1183)
c) On detection of 'failure to fetch' we probably should have exponential 
back-offs (versus the same order back-offs as currently) for hosts in the 
'penalty box'.
d) If fetches still fail for say 4 times (after exponential backoffs), we 
should declare the Reduce as 'failed'.

This situation could also arise from situations like full-disks on the reducer, 
whereby it isn't possible to save the map output on the local disk (say for 
large map outputs).

Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to