[ 
https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2724:
----------------------------
    Description: 
>From the logs, it seems the ipc retry interval is set as 20 seconds and ipc 
>max retries is 45. This means that the client will retry the RPC connection 
>for total 900 (20*45) seconds. And in this period, the application may already 
>complete and RM Restarting may be triggered as said in the jira description. 
>And I think the RM recovery is not enabled, so even the new RM is restarted, 
>the original application info is lost, that means the client can never get the 
>correct application report which makes it showing the old status forever. 
{code}
15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: 
maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45
Deleted /user/hadoopqa/Input1

RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls 
/user/hadoopqa/Input2

RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs  -rm -r 
-skipTrash /user/hadoopqa/Input2

15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: 
maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45
{code}


Configuration to reproduce this issue
* disable generic application history 
(yarn.timeline-service.generic-application-history.enabled)
* disable rm recovery (yarn.resourcemanager.recovery.enabled)
* increase the ipc retry interval and max retry 
(ipc.client.connect.retry.interval & ipc.client.connect.max.retries)

  was:
>From the logs, it seems the ipc retry interval is set as 20 seconds and ipc 
>max retries is 45. This means that the client will retry the RPC connection 
>for total 900 (20*45) seconds. And in this period, the application may already 
>complete and RM Restarting may be triggered as said in the jira description. 
>And I think the RM recovery is not enabled, so even the new RM is restarted, 
>the original application info is lost, that means the client can never get the 
>correct application report which makes it showing the old status forever. 
>Karam Singh Can you confirm that the RM recovery is not enabled in the system 
>test ? (yarn.resourcemanager.recovery.enabled by default it is false). If so, 
>this can be fixed by checking ATS in 
>DAGClientImpl#checkAndSetDagCompletionStatus.
{code}
15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: 
maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45
Deleted /user/hadoopqa/Input1

RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls 
/user/hadoopqa/Input2

RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs  -rm -r 
-skipTrash /user/hadoopqa/Input2

15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: 
maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45
{code}


> Tez Client keeps on showing old status when application is finished but RM is 
> shutdown
> --------------------------------------------------------------------------------------
>
>                 Key: TEZ-2724
>                 URL: https://issues.apache.org/jira/browse/TEZ-2724
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.4
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: amrecovery_mutlipleamrestart.txt
>
>
> From the logs, it seems the ipc retry interval is set as 20 seconds and ipc 
> max retries is 45. This means that the client will retry the RPC connection 
> for total 900 (20*45) seconds. And in this period, the application may 
> already complete and RM Restarting may be triggered as said in the jira 
> description. And I think the RM recovery is not enabled, so even the new RM 
> is restarted, the original application info is lost, that means the client 
> can never get the correct application report which makes it showing the old 
> status forever. 
> {code}
> 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: 
> maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45
> Deleted /user/hadoopqa/Input1
> RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls 
> /user/hadoopqa/Input2
> RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs  -rm -r 
> -skipTrash /user/hadoopqa/Input2
> 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: 
> maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45
> {code}
> Configuration to reproduce this issue
> * disable generic application history 
> (yarn.timeline-service.generic-application-history.enabled)
> * disable rm recovery (yarn.resourcemanager.recovery.enabled)
> * increase the ipc retry interval and max retry 
> (ipc.client.connect.retry.interval & ipc.client.connect.max.retries)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to