[
https://issues.apache.org/jira/browse/HADOOP-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617393#action_12617393
]
Amareshwari Sriramadasu commented on HADOOP-3462:
-------------------------------------------------
There could be some problems in the suggested approach. For example, there
could be a faulty task which is writing to scratch space and making the disk
out of space. Currently such tips would get failed in four attempts, thereby
kill the job; which is the intended behavior. But making them FAILED_INTERNAL
will not kill the job and just blacklist all the task trackers.
And also if there are map tasks generating large map output files or reduce
tasks generating large merge files, the job should get killed, instead of
trying to run the map or reduce on all the tasktrackers.
To address this,
1. One solution could be: we can have a configuration property
_mapred.map/reduce.max.internal.failures_. And a tip can be killed if the
number of internal failures of the attempts exceed the
_mapred.map/reduce.max.internal.failures_. Then we have to decide on the
default number for this. But, this approach could take more time to kill the
job.
2. Another solution could be to limit the disk space available for a task
(something in the lines of process-ulimit?). And fail the task if it is
exceeding the allotted space. But here, it would be difficult to keep track of
disk space used by the task.
Thoughts?
> reduce task failures during shuffling should not count against number of
> retry attempts
> ---------------------------------------------------------------------------------------
>
> Key: HADOOP-3462
> URL: https://issues.apache.org/jira/browse/HADOOP-3462
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.3
> Reporter: Christian Kunz
> Assignee: Amareshwari Sriramadasu
> Fix For: 0.19.0
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.