Gunther Hagleitner created TEZ-3405:
---------------------------------------

             Summary: Support ability for AM to kill itself if there is no 
client heartbeating to it
                 Key: TEZ-3405
                 URL: https://issues.apache.org/jira/browse/TEZ-3405
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Gunther Hagleitner
            Priority: Critical


HiveServer2 optionally maintains a pool of AMs in either Tez or LLAP mode. This 
is done to amortize the cost of launching a Tez session.

We also try in a shutdown hook to kill all these AMs when HS2 goes down. 
However, there are cases where HS2 doesn't get the chance to kill these AMs 
before it goes away. As a result these zombie AMs hang around until the timeout 
kicks in.

The trouble with the timeout is that we have to set it fairly high. Otherwise 
the benefit of having pre-launched AMs obviously goes away (in a lightly loaded 
cluster).

So, if people kill/restart HS2 they often times run into situations where the 
cluster/queue doesn't have any more capacity for AMs. They either have to 
manually kill the zombies or wait.

The request is therefore for Tez to maintain a heartbeat to the client. If the 
client goes away the AM should exit. That way we can keep the AMs alive for a 
long time regardless of activity and at the same time don't have to worry about 
them if HS2 goes down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to