[ 
https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593793#action_12593793
 ] 

Christian Kunz commented on HADOOP-3333:
----------------------------------------

Arun,
Number of blacklisted TaskTrackers is low (less than 1%), because we have a 
high threshold (100 failures) for TaskTrackers to be declared blacklisted. In 
the past, with the default setting, we lost too many TaskTrackers too fast even 
when there were no hardware issues -- but this might have been fixed and we 
might want to change this back to a more reasonable value. On the other hand, 
we did not have any problems using the high value till 0.16.3.

Amar,
With a 'marginal' TaskTracker I mean a TaskTracker running on a node with 
hardware failures, that still runs most short tasks successfully, but with a 
higher chance of failing long running tasks (e.g. reduce tasks shuffling the 
map outputs from many waves of short map tasks).
Concerning 'repeatedly same tasks assigned to same Tasktracker', I can point 
you to a running job offline exhibiting the problem.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and 
> current jobs risks to fail as well, because  reduce tasks failing on marginal 
> TaskTrackers are assigned repeatedly to the same TaskTrackers (probably 
> because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or 
> TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this 
> case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to