Checksum error on a single reducer does not trigger too many fetch failures for 
mapper during shuffle
-----------------------------------------------------------------------------------------------------

                 Key: MAPREDUCE-1202
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1202
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: jobtracker
    Affects Versions: 0.20.1
            Reporter: Qi Liu
            Priority: Critical


During one run of a large map-reduce job, a single reducer keep throwing 
Checksum exception when try to shuffle from one mapper. The data on the mapper 
node for that particular reducer is believed to be corrupted, since there are 
disk issues on the mapper node. However, even with hundreds of retries to fetch 
the shuffling data for that particular reducer, and numerous reports to job 
tracker due to this issue, the mapper is still not declared as too many fetch 
failures in job tracker.

Here is the log:
2009-11-10 19:55:05,655 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_200911010621_0023_r_005396_0 Scheduled 1 outputs (0 slow hosts and0 dup 
hosts)
2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: header: 
attempt_200911010621_0023_m_039676_0, compressed len: 449177, decompressed len: 
776729
2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
776729 bytes (449177 raw bytes) into RAM from 
attempt_200911010621_0023_m_039676_0
2009-11-10 19:55:38,737 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
shuffle from attempt_200911010621_0023_m_039676_0
org.apache.hadoop.fs.ChecksumException: Checksum Error
        at 
org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
        at 
org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
        at 
org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
        at 
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
        at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)
2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask: 
attempt_200911010621_0023_r_005396_0 copy failed: 
attempt_200911010621_0023_m_039676_0 from xx.yy.com
2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask: 
org.apache.hadoop.fs.ChecksumException: Checksum Error
        at 
org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
        at 
org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
        at 
org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
        at 
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
        at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)

2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Task 
attempt_200911010621_0023_r_005396_0: Failed fetch #113 from 
attempt_200911010621_0023_m_039676_0
2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
fetch map-output from attempt_200911010621_0023_m_039676_0 even after 
MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to the 
JobTracker


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to