[
https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594774#action_12594774
]
Amar Kamat commented on HADOOP-3333:
------------------------------------
Why dont we do something like
!) If {{(num-unique-nodes-amongst-the-trackers /
total-trackers-registered-with-jt) > K}} then blacklist the node for that TIP
i.e do as discussed above.
2) Else avoid blacklisting the host for that TIP (similar to the current
blacklisting of trackers).
Where K = 0.25
This will overcome the corner case where the cluster is running on smaller
number of nodes and the TIP has failed on atleast one tracker on each node.
Also the test can be kept as it is.
> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
> Key: HADOOP-3333
> URL: https://issues.apache.org/jira/browse/HADOOP-3333
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.3
> Reporter: Christian Kunz
> Assignee: Arun C Murthy
> Priority: Critical
> Fix For: 0.18.0
>
> Attachments: HADOOP-3333_0_20080503.patch,
> HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and
> current jobs risks to fail as well, because reduce tasks failing on marginal
> TaskTrackers are assigned repeatedly to the same TaskTrackers (probably
> because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or
> TaskTrackers need to get some better smarts to find failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this
> case.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.