[ 
https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593807#action_12593807
 ] 

Arun C Murthy commented on HADOOP-3333:
---------------------------------------

Here are the symptoms and possible remedies...

1. The same TIP FAILED on a previously 'lost' tasktracker.
2. The same TIP FAILED on the same machine, however the tasktracker had a 
different 'port'. i.e. Failed on x.y.z:30342 and x.y.z:34223

So, a couple of thoughts:
1. We might have to rework the logic to work around task FAILURES; currently 
the JT only schedules around nodes where the task FAILED. However a lost 
tasktracker leads to tasks being marked KILLED.
2. We also have to track hostnames rather than 'trackernames', trackername 
includes the host:port... (#2)

Thoughts?

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and 
> current jobs risks to fail as well, because  reduce tasks failing on marginal 
> TaskTrackers are assigned repeatedly to the same TaskTrackers (probably 
> because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or 
> TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this 
> case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to