[ 
https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700228#comment-14700228
 ] 

Hitesh Shah edited comment on TEZ-2724 at 8/17/15 9:05 PM:
-----------------------------------------------------------

I think this is an edge case where RM HA is not enabled or if RM recovery is 
not enabled. 

I think the switch to using TimelineClient should only happen in the following 
condition: RM either says app finished or throws an AppNotFound exception ( 
AppNotFound would imply recovery disabled ). If the RM is down, we should just 
wait or throw an error if it is being done today. Switching to the 
TimelineClient while the RM is down is probably going to be problematic as it 
will not switch back to the AM after the RM comes back up ( if recovery is 
enabled ).



was (Author: hitesh):
I think this is an edge case where RM HA is not enabled or if RM recovery is 
not enabled. 

I think the switch to using TimelineClient should only happen in the following 
condition: RM either says app finished or throws an AppNotFound exception. If 
the RM is down, we should just wait or throw an error if it is being done 
today. Switching to the TimelineClient while the RM is down is probably going 
to be problematic as it will not switch back to the AM after the RM comes back 
up ( if recovery is enabled ).


> Tez Client keeps on showing old status when application is finished but RM is 
> shutdown
> --------------------------------------------------------------------------------------
>
>                 Key: TEZ-2724
>                 URL: https://issues.apache.org/jira/browse/TEZ-2724
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.4
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt
>
>
> From the logs, it seems the ipc retry interval is set as 20 seconds and ipc 
> max retries is 45. This means that the client will retry the RPC connection 
> for total 900 (20*45) seconds. And in this period, the application may 
> already complete and RM Restarting may be triggered as said in the jira 
> description. And I think the RM recovery is not enabled, so even the new RM 
> is restarted, the original application info is lost, that means the client 
> can never get the correct application report which makes it showing the old 
> status forever. 
> {code}
> 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: 
> maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45
> Deleted /user/hadoopqa/Input1
> RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls 
> /user/hadoopqa/Input2
> RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs  -rm -r 
> -skipTrash /user/hadoopqa/Input2
> 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: 
> maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45
> {code}
> Configuration to reproduce this issue
> * disable generic application history 
> (yarn.timeline-service.generic-application-history.enabled)
> * disable rm recovery (yarn.resourcemanager.recovery.enabled)
> * increase the ipc retry interval and max retry 
> (ipc.client.connect.retry.interval & ipc.client.connect.max.retries)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to