[ 
https://issues.apache.org/jira/browse/TEZ-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121134#comment-14121134
 ] 

Siddharth Seth commented on TEZ-1541:
-------------------------------------

One possible (brute force fix), would be to halt the AM after a specific 
timeout - in case the regular shutdown does not work.

Alternately, the unregister call could be placed into it's own thraed (which 
would unblock other services from shutting down). This has the drawback of 
failing to unregister an app cleanly on a regular cluster though.

The 5 second delay on unregistration makes this problem worse, especially when 
the MiniTezCluster is involved, since that provides a big window for such a 
condition to occur (RM going down, while Tez is still trying to communicate 
with it). Shutting down the scheduler early - unregistering the app from the RM 
early can reduce the size of the window, but doesn't get rid of the problem 
completely.

For the MiniCluster case, we could also have it actively kill all running 
applications while shutting down.

> DAGAppMaster can get stuck on shutdown if the RM is no longer around
> --------------------------------------------------------------------
>
>                 Key: TEZ-1541
>                 URL: https://issues.apache.org/jira/browse/TEZ-1541
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Siddharth Seth
>         Attachments: dagapp.threads.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to