Eric Payne created MAPREDUCE-6166:
-------------------------------------

             Summary: Reducers do not catch bad map output transfers during 
shuffle if data shuffled directly to disk
                 Key: MAPREDUCE-6166
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
    Affects Versions: 2.6.0
            Reporter: Eric Payne
            Assignee: Eric Payne


In very large map/reduce jobs (50000 maps, 2500 reducers), the intermediate map 
partition output gets corrupted on disk on the map side. If this corrupted map 
output is too large to shuffle in memory, the reducer streams it to disk 
without validating the checksum. In jobs this large, it could take hours before 
the reducer finally tries to read the corrupted file and fails. Since retries 
of the failed reduce attempt will also take hours, this delay in discovering 
the failure is multiplied greatly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to