Rajesh Balamohan created TEZ-2738:
-------------------------------------
Summary: ContainerLauncher tries to connect to unhealthy node for
large number of times
Key: TEZ-2738
URL: https://issues.apache.org/jira/browse/TEZ-2738
Project: Apache Tez
Issue Type: Bug
Reporter: Rajesh Balamohan
Env: Ran a job with tez (built from master branch on aug-24).
One of the nodes went down in the middle of the run. And DAGAppMaster had a
container launch in that node. After sometime, this node was declared as
unhealthy. Even though the job lasted only for 7 minutes, DAGAppMaster was
unresponsive after dag cleanup for > 1.5 hours. It kept on trying to connect
to the unhealthy node. I will attach the logs in this JIRA.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)