[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Devaraj Das (JIRA) Tue, 30 Sep 2008 05:53:09 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635736#action_12635736
 ]


Devaraj Das commented on HADOOP-4163:
-------------------------------------

Today, if the copier thread (ReduceTask.ReduceCopier.MapOutputCopier.run()) 
throws a Throwable, it is logged an ignored. I am wondering whether it makes 
sense to treat all exceptions except IOExceptions (mostly due to network 
issues) as fatal. Here is one thought -
Rename mergeThrowable to shuffleThrowable. In the copier thread, we could set 
shuffleThrowable when Throwable is caught (IOException is caught separately 
already). In all the places where mergeThrowable is set, we could set 
shuffleThrowable. The loop inside fetchOutputs could check whether 
shuffleThrowable is non-null.
When fetchOutputs returns with a 'false', we could check whether the 
shuffleThrowable is an instance of Error and if so, throw the Error out. In the 
other case, we could wrap it in an IOException. Doing it in the above way would 
mean that we call umbilical.fsError at exactly one place - in Child.main(). 
But I am slightly apprehensive about the implication of this change this late 
in the game.. Thoughts ?


> If a reducer failed at shuffling stage, the task should fail, not just 
> logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception 
> logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output 
> copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space 
> left on device
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
>       at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>       at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>       at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
>       at 
> org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
>       at java.io.FileOutputStream.writeBytes(Native Method)
>       at java.io.FileOutputStream.write(FileOutputStream.java:260)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
>       ... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
>       at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Reply via email to