[
https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736453#comment-14736453
]
Jeff Zhang commented on TEZ-2724:
---------------------------------
Steps to reproduce this issue:
* configuration requirements:
** yarn.timeline-service.generic-application-history.enabled=false
** yarn.resourcemanager.recovery.enabled=false
** ipc.client.connect.retry.interval=5000
** ipc.client.connect.max.retries=12
* Run command: "hadoop jar tez-tests/target/tez-tests-0.8.1-SNAPSHOT.jar
mrrsleep -m 5 -r 5 -mt 20000 -rt 10000"
* Kill the AM in the middle of job running
* Check the RM UI to wait for the yarn app finished, then restart RM
> Tez Client keeps on showing old status when application is finished but RM is
> shutdown
> --------------------------------------------------------------------------------------
>
> Key: TEZ-2724
> URL: https://issues.apache.org/jira/browse/TEZ-2724
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.5.4
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt
>
>
> From the logs, it seems the ipc retry interval is set as 20 seconds and ipc
> max retries is 45. This means that the client will retry the RPC connection
> for total 900 (20*45) seconds. And in this period, the application may
> already complete and RM Restarting may be triggered as said in the jira
> description. And I think the RM recovery is not enabled, so even the new RM
> is restarted, the original application info is lost, that means the client
> can never get the correct application report which makes it showing the old
> status forever.
> {code}
> 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server:
> maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45
> Deleted /user/hadoopqa/Input1
> RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls
> /user/hadoopqa/Input2
> RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r
> -skipTrash /user/hadoopqa/Input2
> 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server:
> maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45
> {code}
> Configuration to reproduce this issue
> * disable generic application history
> (yarn.timeline-service.generic-application-history.enabled)
> * disable rm recovery (yarn.resourcemanager.recovery.enabled)
> * increase the ipc retry interval and max retry
> (ipc.client.connect.retry.interval & ipc.client.connect.max.retries)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)