[
https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593825#action_12593825
]
Devaraj Das commented on HADOOP-3333:
-------------------------------------
bq. 1. We might have to rework the logic to work around task FAILURES;
currently the JT only schedules around nodes where the task FAILED. However a
lost tasktracker leads to tasks being marked KILLED.
IMO we should leave this logic unchanged. The re-execution on this lost TT, if
it fails, will make the JT schedule tasks around that.
bq. 2. We also have to track hostnames rather than 'trackernames', trackername
includes the host:port... (#2)
This makes sense (as long as we don't depend host:port esp. in the unit tests).
> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
> Key: HADOOP-3333
> URL: https://issues.apache.org/jira/browse/HADOOP-3333
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.3
> Reporter: Christian Kunz
> Assignee: Arun C Murthy
> Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and
> current jobs risks to fail as well, because reduce tasks failing on marginal
> TaskTrackers are assigned repeatedly to the same TaskTrackers (probably
> because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or
> TaskTrackers need to get some better smarts to find failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this
> case.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.