Karthik Kambatla created MAPREDUCE-5877:
-------------------------------------------
Summary: Inconsistency between JT/TT for tasks taking a long time
to launch
Key: MAPREDUCE-5877
URL: https://issues.apache.org/jira/browse/MAPREDUCE-5877
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: jobtracker, tasktracker
Affects Versions: 1.2.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
For the tasks that take too long to launch (for genuine reasons like large
distributed caches), JT expires the task. Depending on whether job recovery is
enabled and the JT's restart state, another attempt is launched or not even
when the JT is not restarted. The status of the attempt changes to "Error
launching task". Meanwhile, the TT is not informed of this task expiry and
eventually launches the task.
To avoid this weird behavior, one can bump up the
mapred.tasktracker.expiry.interval, but leading to long TT failure discovery
times.
We should have a per-job timeout for task launches/ heartbeat and JT/TT should
be consistent in what they say.
--
This message was sent by Atlassian JIRA
(v6.2#6252)