[jira] [Updated] (TEZ-2738) ContainerLauncher tries to connect to unhealthy node for large number of times

Rajesh Balamohan (JIRA) Tue, 25 Aug 2015 00:28:30 -0700

     [ 
https://issues.apache.org/jira/browse/TEZ-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rajesh Balamohan updated TEZ-2738:
----------------------------------
    Attachment: log.txt.gz

Thanks [~zjffdu]. Looks like the entire logs didn't get uploaded.  Attaching 
log.txt.gz again.

>>
2015-08-24 12:46:48,446 INFO [Dispatcher thread: Central] node.AMNodeImpl: 
AMNode cn051-10:47778 transitioned from ACTIVE to UNHEALTHY
>>

> ContainerLauncher tries to connect to unhealthy node for large number of times
> ------------------------------------------------------------------------------
>
>                 Key: TEZ-2738
>                 URL: https://issues.apache.org/jira/browse/TEZ-2738
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: log.txt.gz, logs.tar.gz
>
>
> Env: Ran a job with tez (built from master branch on aug-24). 
> One of the nodes went down in the middle of the run. And DAGAppMaster had a 
> container launch in that node. After sometime, this node was declared as 
> unhealthy.  Even though the job lasted only for 7 minutes, DAGAppMaster was 
> unresponsive after dag cleanup for > 1.5 hours.  It kept on trying to connect 
> to the unhealthy node. I will attach the logs in this JIRA.
> ipc.client.connect.max.retries has been set to 50 in core-site.xml
> {noformat}
>  <property>
>     <name>ipc.client.connect.max.retries</name>
>     <value>50</value>
>     <description>Defines the maximum number of retries for IPC 
> connections.</description>
>   </property>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-2738) ContainerLauncher tries to connect to unhealthy node for large number of times

Reply via email to