Eric Payne created MAPREDUCE-6166:
-------------------------------------
Summary: Reducers do not catch bad map output transfers during
shuffle if data shuffled directly to disk
Key: MAPREDUCE-6166
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: mrv2
Affects Versions: 2.6.0
Reporter: Eric Payne
Assignee: Eric Payne
In very large map/reduce jobs (50000 maps, 2500 reducers), the intermediate map
partition output gets corrupted on disk on the map side. If this corrupted map
output is too large to shuffle in memory, the reducer streams it to disk
without validating the checksum. In jobs this large, it could take hours before
the reducer finally tries to read the corrupted file and fails. Since retries
of the failed reduce attempt will also take hours, this delay in discovering
the failure is multiplied greatly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)