[ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635736#action_12635736 ]
Devaraj Das commented on HADOOP-4163: ------------------------------------- Today, if the copier thread (ReduceTask.ReduceCopier.MapOutputCopier.run()) throws a Throwable, it is logged an ignored. I am wondering whether it makes sense to treat all exceptions except IOExceptions (mostly due to network issues) as fatal. Here is one thought - Rename mergeThrowable to shuffleThrowable. In the copier thread, we could set shuffleThrowable when Throwable is caught (IOException is caught separately already). In all the places where mergeThrowable is set, we could set shuffleThrowable. The loop inside fetchOutputs could check whether shuffleThrowable is non-null. When fetchOutputs returns with a 'false', we could check whether the shuffleThrowable is an instance of Error and if so, throw the Error out. In the other case, we could wrap it in an IOException. Doing it in the above way would mean that we call umbilical.fsError at exactly one place - in Child.main(). But I am slightly apprehensive about the implication of this change this late in the game.. Thoughts ? > If a reducer failed at shuffling stage, the task should fail, not just > logging an exception > ------------------------------------------------------------------------------------------- > > Key: HADOOP-4163 > URL: https://issues.apache.org/jira/browse/HADOOP-4163 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.17.1 > Reporter: Runping Qi > Assignee: Sharad Agarwal > Priority: Blocker > Fix For: 0.19.0 > > Attachments: 4163_v1.patch, 4163_v2.patch > > > I saw a reducer stuck at the shuffling stage, with the following exception > logged in the log file: > 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output > copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space > left on device > at > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) > at java.io.FilterOutputStream.close(FilterOutputStream.java:140) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79) > at > org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764) > Caused by: java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197) > ... 11 more > 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error > running child > java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329) > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122) > The task should have died. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.