[
https://issues.apache.org/jira/browse/TEZ-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-2738:
----------------------------------
Attachment: logs.tar.gz
Attaching logs
> ContainerLauncher tries to connect to unhealthy node for large number of times
> ------------------------------------------------------------------------------
>
> Key: TEZ-2738
> URL: https://issues.apache.org/jira/browse/TEZ-2738
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Attachments: logs.tar.gz
>
>
> Env: Ran a job with tez (built from master branch on aug-24).
> One of the nodes went down in the middle of the run. And DAGAppMaster had a
> container launch in that node. After sometime, this node was declared as
> unhealthy. Even though the job lasted only for 7 minutes, DAGAppMaster was
> unresponsive after dag cleanup for > 1.5 hours. It kept on trying to connect
> to the unhealthy node. I will attach the logs in this JIRA.
> ipc.client.connect.max.retries has been set to 50 in core-site.xml
> {noformat}
> <property>
> <name>ipc.client.connect.max.retries</name>
> <value>50</value>
> <description>Defines the maximum number of retries for IPC
> connections.</description>
> </property>
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)