[
https://issues.apache.org/jira/browse/TEZ-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710740#comment-14710740
]
Jeff Zhang edited comment on TEZ-2738 at 8/25/15 7:06 AM:
----------------------------------------------------------
[~rajesh.balamohan] The large number of times of retrying is due to this yarn
bug YARN-3238
bq. After sometime, this node was declared as unhealthy.
I didn't find the logs that says this node is transitioned to UNHEALTHY
was (Author: zjffdu):
[~rajesh.balamohan] The large number of times of retrying is due to this yarn
bug YARN-3944
bq. After sometime, this node was declared as unhealthy.
I didn't find the logs that says this node is transitioned to UNHEALTHY
> ContainerLauncher tries to connect to unhealthy node for large number of times
> ------------------------------------------------------------------------------
>
> Key: TEZ-2738
> URL: https://issues.apache.org/jira/browse/TEZ-2738
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Attachments: logs.tar.gz
>
>
> Env: Ran a job with tez (built from master branch on aug-24).
> One of the nodes went down in the middle of the run. And DAGAppMaster had a
> container launch in that node. After sometime, this node was declared as
> unhealthy. Even though the job lasted only for 7 minutes, DAGAppMaster was
> unresponsive after dag cleanup for > 1.5 hours. It kept on trying to connect
> to the unhealthy node. I will attach the logs in this JIRA.
> ipc.client.connect.max.retries has been set to 50 in core-site.xml
> {noformat}
> <property>
> <name>ipc.client.connect.max.retries</name>
> <value>50</value>
> <description>Defines the maximum number of retries for IPC
> connections.</description>
> </property>
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)