[ https://issues.apache.org/jira/browse/MAPREDUCE-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219077#comment-15219077 ]
Eric Payne commented on MAPREDUCE-6633: --------------------------------------- Thanks [~shahrs87] for reporting this issue and providing a patch. Overall, the patch looks good. I am a little nervous about re-fetching for _any_ exception. If there is a runtime exception on the reducer (memory error, NPE, etc.), maps would be re-run unnecessarily. Although I do understand that the risk of that is low, and in any case, no data would be lost, just a little time and wasted resources. What are your thoughts? > AM should retry map attempts if the reduce task encounters commpression > related errors. > --------------------------------------------------------------------------------------- > > Key: MAPREDUCE-6633 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6633 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.7.2 > Reporter: Rushabh S Shah > Assignee: Rushabh S Shah > Attachments: MAPREDUCE-6633.patch > > > When reduce task encounters compression related errors, AM doesn't retry the > corresponding map task. > In one of the case we encountered, here is the stack trace. > {noformat} > 2016-01-27 13:44:28,915 WARN [main] org.apache.hadoop.mapred.YarnChild: > Exception running child : > org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in > shuffle in fetcher#29 > at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at > com.hadoop.compression.lzo.LzoDecompressor.setInput(LzoDecompressor.java:196) > at > org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:104) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) > at > org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.shuffle(InMemoryMapOutput.java:97) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:537) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) > {noformat} > In this case, the node on which the map task ran had a bad drive. > If the AM had retried running that map task somewhere else, the job > definitely would have succeeded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)