[
https://issues.apache.org/jira/browse/TEZ-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146898#comment-14146898
]
Bikas Saha commented on TEZ-1620:
---------------------------------
There are 2 related issues here
1) The fact that the AM sleeps for 5 secs before exiting. This is generally a
waste of time. Also I think currently this is what makes the local mode unit
tests work because the AM sleeps while the test exits. If the test does not
exit after the AM sleep is over then the AM system.exit() will bring the test
down.
2) This sleep can cause a race in the minicluster that shuts down the cluster
before the AM exits which causes the YARN rmClient in the AM scheduler to wait
for the mini cluster RM to come back up (for RM HA). This causes orphaned
DAGAppMaster processes.
For 1) this is there to prevent the AM from exiting before the client can poll
the AM for success status. Solutions could be that the AM could remember if it
has already given the client a success status and if so, not sleep. Or
TezClient.stop() could be made to send a shutdown signal to the AM that would
interrupt the sleep. This would however break the local mode tests as the
system.exit() would kick in. We can double check this and look at fixing the
local mode AM to not do a system.exit()
For just the minicluster case, we could change the TezMiniCluster.stop() to
make it kill all outstanding applications and then wait for the running apps to
drain before stopping.
Any other ideas or potential solutions?
> Wait for application finish before stopping MiniTezCluster
> ----------------------------------------------------------
>
> Key: TEZ-1620
> URL: https://issues.apache.org/jira/browse/TEZ-1620
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
>
> Currently, we sleep 10 seconds to wait for DAGAppMaster to finish, otherwise
> DAGAppMaster will hang there for connecting RM to unregister.
> We should wait for all the applications finish before stopping
> MiniTezCluster.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)