[
https://issues.apache.org/jira/browse/HADOOP-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Amareshwari Sriramadasu updated HADOOP-3462:
--------------------------------------------
Attachment: patch-3462.txt
Shuffle failures with DiskOutOfSpaceException on four different task trackers
should not say that the failure is because of framework. For example, sort
benchmark on 500 nodes run with single reducer will definitely fail because of
out of memory. Such tasks which require large disk space should not make the
job infinitely running because of no space left. Considering this, here is
patch for review.
In the patch:
1. Errors/exceptions during shuffle phase of the task, except
DiskOutOfSpaceException, will mark the attempt *FAILED_FRAMEWORK*.
DiskOutOfSpaceException will mark the task as FAILED.
2. The FAILED_FRAMEWORK attempts of tip blacklist the tracker, but not kill the
job.
3. Adds a public api isDiskOutOfSpaceException(Throwable th) in
org.apache.hadoop.util.DiskChecker
4. jsp files are changed to show the FAILED_FRAMEWORK attempts as part of job
failures.
> reduce task failures during shuffling should not count against number of
> retry attempts
> ---------------------------------------------------------------------------------------
>
> Key: HADOOP-3462
> URL: https://issues.apache.org/jira/browse/HADOOP-3462
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.3
> Reporter: Christian Kunz
> Assignee: Amareshwari Sriramadasu
> Fix For: 0.19.0
>
> Attachments: patch-3462.txt
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.