[
https://issues.apache.org/jira/browse/TEZ-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121134#comment-14121134
]
Siddharth Seth commented on TEZ-1541:
-------------------------------------
One possible (brute force fix), would be to halt the AM after a specific
timeout - in case the regular shutdown does not work.
Alternately, the unregister call could be placed into it's own thraed (which
would unblock other services from shutting down). This has the drawback of
failing to unregister an app cleanly on a regular cluster though.
The 5 second delay on unregistration makes this problem worse, especially when
the MiniTezCluster is involved, since that provides a big window for such a
condition to occur (RM going down, while Tez is still trying to communicate
with it). Shutting down the scheduler early - unregistering the app from the RM
early can reduce the size of the window, but doesn't get rid of the problem
completely.
For the MiniCluster case, we could also have it actively kill all running
applications while shutting down.
> DAGAppMaster can get stuck on shutdown if the RM is no longer around
> --------------------------------------------------------------------
>
> Key: TEZ-1541
> URL: https://issues.apache.org/jira/browse/TEZ-1541
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Siddharth Seth
> Attachments: dagapp.threads.txt
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)