[ 
https://issues.apache.org/jira/browse/TEZ-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146898#comment-14146898
 ] 

Bikas Saha commented on TEZ-1620:
---------------------------------

There are 2 related issues here
1) The fact that the AM sleeps for 5 secs before exiting. This is generally a 
waste of time. Also I think currently this is what makes the local mode unit 
tests work because the AM sleeps while the test exits. If the test does not 
exit after the AM sleep is over then the AM system.exit() will bring the test 
down.
2) This sleep can cause a race in the minicluster that shuts down the cluster 
before the AM exits which causes the YARN rmClient in the AM scheduler to wait 
for the mini cluster RM to come back up (for RM HA). This causes orphaned 
DAGAppMaster processes.

For 1) this is there to prevent the AM from exiting before the client can poll 
the AM for success status. Solutions could be that the AM could remember if it 
has already given the client a success status and if so, not sleep. Or 
TezClient.stop() could be made to send a shutdown signal to the AM that would 
interrupt the sleep. This would however break the local mode tests as the 
system.exit() would kick in. We can double check this and look at fixing the 
local mode AM to not do a system.exit()

For just the minicluster case, we could change the TezMiniCluster.stop() to 
make it kill all outstanding applications and then wait for the running apps to 
drain before stopping.

Any other ideas or potential solutions?

> Wait for application finish before stopping MiniTezCluster
> ----------------------------------------------------------
>
>                 Key: TEZ-1620
>                 URL: https://issues.apache.org/jira/browse/TEZ-1620
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>
> Currently, we sleep 10 seconds to wait for DAGAppMaster to finish, otherwise 
> DAGAppMaster will hang there for connecting RM to unregister. 
> We should wait for all the applications finish before stopping 
> MiniTezCluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to