[ 
https://issues.apache.org/jira/browse/TEZ-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2738:
----------------------------------
    Description: 
Env: Ran a job with tez (built from master branch on aug-24). 

One of the nodes went down in the middle of the run. And DAGAppMaster had a 
container launch in that node. After sometime, this node was declared as 
unhealthy.  Even though the job lasted only for 7 minutes, DAGAppMaster was 
unresponsive after dag cleanup for > 1.5 hours.  It kept on trying to connect 
to the unhealthy node. I will attach the logs in this JIRA.

ipc.client.connect.max.retries has been set to 50 in core-site.xml

{noformat}
 <property>
    <name>ipc.client.connect.max.retries</name>
    <value>50</value>
    <description>Defines the maximum number of retries for IPC 
connections.</description>
  </property>
{noformat}


  was:
Env: Ran a job with tez (built from master branch on aug-24). 

One of the nodes went down in the middle of the run. And DAGAppMaster had a 
container launch in that node. After sometime, this node was declared as 
unhealthy.  Even though the job lasted only for 7 minutes, DAGAppMaster was 
unresponsive after dag cleanup for > 1.5 hours.  It kept on trying to connect 
to the unhealthy node. I will attach the logs in this JIRA.


> ContainerLauncher tries to connect to unhealthy node for large number of times
> ------------------------------------------------------------------------------
>
>                 Key: TEZ-2738
>                 URL: https://issues.apache.org/jira/browse/TEZ-2738
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>
> Env: Ran a job with tez (built from master branch on aug-24). 
> One of the nodes went down in the middle of the run. And DAGAppMaster had a 
> container launch in that node. After sometime, this node was declared as 
> unhealthy.  Even though the job lasted only for 7 minutes, DAGAppMaster was 
> unresponsive after dag cleanup for > 1.5 hours.  It kept on trying to connect 
> to the unhealthy node. I will attach the logs in this JIRA.
> ipc.client.connect.max.retries has been set to 50 in core-site.xml
> {noformat}
>  <property>
>     <name>ipc.client.connect.max.retries</name>
>     <value>50</value>
>     <description>Defines the maximum number of retries for IPC 
> connections.</description>
>   </property>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to