Checksum error on a single reducer does not trigger too many fetch failures for
mapper during shuffle
-----------------------------------------------------------------------------------------------------
Key: MAPREDUCE-1202
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1202
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: jobtracker
Affects Versions: 0.20.1
Reporter: Qi Liu
Priority: Critical
During one run of a large map-reduce job, a single reducer keep throwing
Checksum exception when try to shuffle from one mapper. The data on the mapper
node for that particular reducer is believed to be corrupted, since there are
disk issues on the mapper node. However, even with hundreds of retries to fetch
the shuffling data for that particular reducer, and numerous reports to job
tracker due to this issue, the mapper is still not declared as too many fetch
failures in job tracker.
Here is the log:
2009-11-10 19:55:05,655 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200911010621_0023_r_005396_0 Scheduled 1 outputs (0 slow hosts and0 dup
hosts)
2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: header:
attempt_200911010621_0023_m_039676_0, compressed len: 449177, decompressed len:
776729
2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling
776729 bytes (449177 raw bytes) into RAM from
attempt_200911010621_0023_m_039676_0
2009-11-10 19:55:38,737 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
shuffle from attempt_200911010621_0023_m_039676_0
org.apache.hadoop.fs.ChecksumException: Checksum Error
at
org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
at
org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
at
org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)
2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200911010621_0023_r_005396_0 copy failed:
attempt_200911010621_0023_m_039676_0 from xx.yy.com
2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask:
org.apache.hadoop.fs.ChecksumException: Checksum Error
at
org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
at
org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
at
org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)
2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_200911010621_0023_r_005396_0: Failed fetch #113 from
attempt_200911010621_0023_m_039676_0
2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
fetch map-output from attempt_200911010621_0023_m_039676_0 even after
MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the
JobTracker
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.