[
https://issues.apache.org/jira/browse/MAPREDUCE-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776358#action_12776358
]
Jothi Padmanabhan commented on MAPREDUCE-1202:
----------------------------------------------
This looks puzzling.0. Could you give us a little more details
# Number of maps/reducers in your job
# Were the other reducers able to fetch outputs from the map in question
successfully?
# Is this reducer able to pull other map outputs successfully?
There are some inbuilt-checks so that the frame work does not kill maps
aggressively, but trying hundreds of times looks like something is definitely
amiss
> Checksum error on a single reducer does not trigger too many fetch failures
> for mapper during shuffle
> -----------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-1202
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1202
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobtracker
> Affects Versions: 0.20.1
> Reporter: Qi Liu
> Priority: Critical
>
> During one run of a large map-reduce job, a single reducer keep throwing
> Checksum exception when try to shuffle from one mapper. The data on the
> mapper node for that particular reducer is believed to be corrupted, since
> there are disk issues on the mapper node. However, even with hundreds of
> retries to fetch the shuffling data for that particular reducer, and numerous
> reports to job tracker due to this issue, the mapper is still not declared as
> too many fetch failures in job tracker.
> Here is the log:
> 2009-11-10 19:55:05,655 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200911010621_0023_r_005396_0 Scheduled 1 outputs (0 slow hosts and0
> dup hosts)
> 2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: header:
> attempt_200911010621_0023_m_039676_0, compressed len: 449177, decompressed
> len: 776729
> 2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling
> 776729 bytes (449177 raw bytes) into RAM from
> attempt_200911010621_0023_m_039676_0
> 2009-11-10 19:55:38,737 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
> shuffle from attempt_200911010621_0023_m_039676_0
> org.apache.hadoop.fs.ChecksumException: Checksum Error
> at
> org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
> at
> org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
> at
> org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
> at
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
> at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)
> 2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask:
> attempt_200911010621_0023_r_005396_0 copy failed:
> attempt_200911010621_0023_m_039676_0 from xx.yy.com
> 2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask:
> org.apache.hadoop.fs.ChecksumException: Checksum Error
> at
> org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
> at
> org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
> at
> org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
> at
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
> at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)
> 2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Task
> attempt_200911010621_0023_r_005396_0: Failed fetch #113 from
> attempt_200911010621_0023_m_039676_0
> 2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
> fetch map-output from attempt_200911010621_0023_m_039676_0 even after
> MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to
> the JobTracker
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.