Rushabh S Shah created MAPREDUCE-6633:
-----------------------------------------

             Summary: AM should retry map attempts if the reduce task 
encounters commpression related errors.
                 Key: MAPREDUCE-6633
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6633
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 2.7.2
            Reporter: Rushabh S Shah
            Assignee: Rushabh S Shah


When reduce task encounters compression related errors, AM  doesn't retry the 
corresponding map task.
In one of the case we encountered, here is the stack trace.
{noformat}
2016-01-27 13:44:28,915 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : 
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle 
in fetcher#29
        at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ArrayIndexOutOfBoundsException
        at 
com.hadoop.compression.lzo.LzoDecompressor.setInput(LzoDecompressor.java:196)
        at 
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:104)
        at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at 
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.shuffle(InMemoryMapOutput.java:97)
        at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:537)
        at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
{noformat}
In this case, the node on which the map task ran had a bad drive.
If the AM had retried running that map task somewhere else, the jib definitely 
would have succeeded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to