[jira] [Comment Edited] (TEZ-2738) ContainerLauncher tries to connect to unhealthy node for large number of times

Jeff Zhang (JIRA) Tue, 25 Aug 2015 00:07:15 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710740#comment-14710740
 ]


Jeff Zhang edited comment on TEZ-2738 at 8/25/15 7:06 AM:
----------------------------------------------------------

[~rajesh.balamohan] The large number of times of retrying is due to this yarn 
bug YARN-3238

bq. After sometime, this node was declared as unhealthy.
I didn't find the logs that says this node is transitioned to UNHEALTHY


was (Author: zjffdu):
[~rajesh.balamohan] The large number of times of retrying is due to this yarn 
bug YARN-3944

bq. After sometime, this node was declared as unhealthy.
I didn't find the logs that says this node is transitioned to UNHEALTHY

> ContainerLauncher tries to connect to unhealthy node for large number of times
> ------------------------------------------------------------------------------
>
>                 Key: TEZ-2738
>                 URL: https://issues.apache.org/jira/browse/TEZ-2738
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: logs.tar.gz
>
>
> Env: Ran a job with tez (built from master branch on aug-24). 
> One of the nodes went down in the middle of the run. And DAGAppMaster had a 
> container launch in that node. After sometime, this node was declared as 
> unhealthy.  Even though the job lasted only for 7 minutes, DAGAppMaster was 
> unresponsive after dag cleanup for > 1.5 hours.  It kept on trying to connect 
> to the unhealthy node. I will attach the logs in this JIRA.
> ipc.client.connect.max.retries has been set to 50 in core-site.xml
> {noformat}
>  <property>
>     <name>ipc.client.connect.max.retries</name>
>     <value>50</value>
>     <description>Defines the maximum number of retries for IPC 
> connections.</description>
>   </property>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TEZ-2738) ContainerLauncher tries to connect to unhealthy node for large number of times

Reply via email to