[ 
https://issues.apache.org/jira/browse/HADOOP-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617393#action_12617393
 ] 

Amareshwari Sriramadasu commented on HADOOP-3462:
-------------------------------------------------

There could be some problems in the suggested approach. For example, there 
could be a faulty task which is writing to scratch space and making the disk 
out of space. Currently such tips would get failed in four attempts, thereby 
kill the job; which is the intended behavior. But making them FAILED_INTERNAL 
will not kill the job and just blacklist all the task trackers. 
And also if there are map tasks generating large map output files or reduce 
tasks generating large merge files, the job should get killed, instead of 
trying to run the map or reduce on all the tasktrackers.

To address this,
1. One solution could be: we can have a configuration property 
_mapred.map/reduce.max.internal.failures_. And a tip can be killed if the 
number of internal failures of the attempts exceed the 
_mapred.map/reduce.max.internal.failures_. Then we have to decide on the 
default number for this. But, this approach could take more time to kill the 
job.
2. Another solution could be to limit the disk space available for a task 
(something in the lines of process-ulimit?). And fail the task if it is 
exceeding the allotted space. But here, it would be difficult to keep track of 
disk space used by the task.   

Thoughts?

> reduce task failures during shuffling should not count against number of 
> retry attempts
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3462
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3462
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.19.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to