[
https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593807#action_12593807
]
Arun C Murthy commented on HADOOP-3333:
---------------------------------------
Here are the symptoms and possible remedies...
1. The same TIP FAILED on a previously 'lost' tasktracker.
2. The same TIP FAILED on the same machine, however the tasktracker had a
different 'port'. i.e. Failed on x.y.z:30342 and x.y.z:34223
So, a couple of thoughts:
1. We might have to rework the logic to work around task FAILURES; currently
the JT only schedules around nodes where the task FAILED. However a lost
tasktracker leads to tasks being marked KILLED.
2. We also have to track hostnames rather than 'trackernames', trackername
includes the host:port... (#2)
Thoughts?
> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
> Key: HADOOP-3333
> URL: https://issues.apache.org/jira/browse/HADOOP-3333
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.3
> Reporter: Christian Kunz
> Assignee: Arun C Murthy
> Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and
> current jobs risks to fail as well, because reduce tasks failing on marginal
> TaskTrackers are assigned repeatedly to the same TaskTrackers (probably
> because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or
> TaskTrackers need to get some better smarts to find failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this
> case.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.