[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776358#action_12776358
 ] 

Jothi Padmanabhan commented on MAPREDUCE-1202:
----------------------------------------------

This looks puzzling.0. Could you give us a little more details
# Number of maps/reducers in your job
# Were the other reducers able to fetch outputs from the map in question 
successfully?
# Is this reducer able to pull other map outputs successfully?

There are some inbuilt-checks so that the frame work does not kill maps 
aggressively, but trying hundreds of times looks like something is definitely 
amiss

> Checksum error on a single reducer does not trigger too many fetch failures 
> for mapper during shuffle
> -----------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1202
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1202
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.20.1
>            Reporter: Qi Liu
>            Priority: Critical
>
> During one run of a large map-reduce job, a single reducer keep throwing 
> Checksum exception when try to shuffle from one mapper. The data on the 
> mapper node for that particular reducer is believed to be corrupted, since 
> there are disk issues on the mapper node. However, even with hundreds of 
> retries to fetch the shuffling data for that particular reducer, and numerous 
> reports to job tracker due to this issue, the mapper is still not declared as 
> too many fetch failures in job tracker.
> Here is the log:
> 2009-11-10 19:55:05,655 INFO org.apache.hadoop.mapred.ReduceTask: 
> attempt_200911010621_0023_r_005396_0 Scheduled 1 outputs (0 slow hosts and0 
> dup hosts)
> 2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200911010621_0023_m_039676_0, compressed len: 449177, decompressed 
> len: 776729
> 2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 776729 bytes (449177 raw bytes) into RAM from 
> attempt_200911010621_0023_m_039676_0
> 2009-11-10 19:55:38,737 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> shuffle from attempt_200911010621_0023_m_039676_0
> org.apache.hadoop.fs.ChecksumException: Checksum Error
>       at 
> org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
>       at 
> org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
>       at 
> org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
>       at 
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
>       at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)
> 2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200911010621_0023_r_005396_0 copy failed: 
> attempt_200911010621_0023_m_039676_0 from xx.yy.com
> 2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask: 
> org.apache.hadoop.fs.ChecksumException: Checksum Error
>       at 
> org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
>       at 
> org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
>       at 
> org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
>       at 
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
>       at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)
> 2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200911010621_0023_r_005396_0: Failed fetch #113 from 
> attempt_200911010621_0023_m_039676_0
> 2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200911010621_0023_m_039676_0 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to