[
https://issues.apache.org/jira/browse/MAPREDUCE-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248044#comment-16248044
]
Jason Lowe commented on MAPREDUCE-6633:
---------------------------------------
If this is very repeatable and seemingly tied only to a specific job while
others run fine then it sounds like it's not the same type of problem this JIRA
was addressing, i.e.: better recovery handling for corrupted compressed files.
If it is a software bug, the stacktrace should be very indicative in this case
where the bug probably lies -- if the stacktrace shows the reducer is in the
shuffle phase then it is likely an issue in the Hadoop framework since it
handles almost everything in the shuffle phase. If the stacktrace indicates
the reducer is in the reduce phase then the balance tips more towards user code
in the reducer. Anyway at this point this discussion is rapidly diverging from
the purpose of this JIRA and should be tracked in a separate JIRA if this
indeed looks like a bug in the Hadoop framework.
> AM should retry map attempts if the reduce task encounters commpression
> related errors.
> ---------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6633
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6633
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.7.2
> Reporter: Rushabh S Shah
> Assignee: Rushabh S Shah
> Fix For: 2.8.0, 2.7.3, 3.0.0-alpha1
>
> Attachments: MAPREDUCE-6633.patch
>
>
> When reduce task encounters compression related errors, AM doesn't retry the
> corresponding map task.
> In one of the case we encountered, here is the stack trace.
> {noformat}
> 2016-01-27 13:44:28,915 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child :
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in
> shuffle in fetcher#29
> at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at
> com.hadoop.compression.lzo.LzoDecompressor.setInput(LzoDecompressor.java:196)
> at
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:104)
> at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
> at
> org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.shuffle(InMemoryMapOutput.java:97)
> at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:537)
> at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
> {noformat}
> In this case, the node on which the map task ran had a bad drive.
> If the AM had retried running that map task somewhere else, the job
> definitely would have succeeded.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]