[ 
https://issues.apache.org/jira/browse/HADOOP-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amareshwari Sriramadasu updated HADOOP-3462:
--------------------------------------------

    Attachment: patch-3462.txt


Shuffle failures with DiskOutOfSpaceException on four different task trackers 
should not say that the failure is because of framework. For example, sort 
benchmark on 500 nodes run with single reducer will definitely fail because of 
out of memory. Such tasks which require large disk space should not make the 
job infinitely running because of no space left. Considering this, here is 
patch for review.

In the patch:
1. Errors/exceptions during shuffle phase of the task, except 
DiskOutOfSpaceException, will mark the attempt *FAILED_FRAMEWORK*. 
DiskOutOfSpaceException will mark the task as FAILED.
2. The FAILED_FRAMEWORK attempts of tip blacklist the tracker, but not kill the 
job.
3. Adds a public api isDiskOutOfSpaceException(Throwable th) in 
org.apache.hadoop.util.DiskChecker
4. jsp files are changed to show the FAILED_FRAMEWORK attempts as part of job 
failures.


> reduce task failures during shuffling should not count against number of 
> retry attempts
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3462
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3462
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.19.0
>
>         Attachments: patch-3462.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to