[jira] [Updated] (TEZ-3405) Support ability for AM to kill itself if there is no client heartbeating to it

Hitesh Shah (JIRA) Thu, 01 Sep 2016 14:59:53 -0700

     [ 
https://issues.apache.org/jira/browse/TEZ-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hitesh Shah updated TEZ-3405:
-----------------------------
    Attachment: TEZ-3405.5.patch

Patch 5 uploaded. 

bq. Nit: TEZ_AM_CLIENT_HEARTBEAT_TIMEOUT_SECS_MINIMUM - Move to TezConstants?

Done.

bq. Nit: Move logic from TezConfiguration to a separate helper class? 
TezConfiguration.java does typically serve as documentation as well.

Moved to TezCommonUtils which is tagged private. Also, moved helper function 
introduced in TEZ-3326 to this class from TezUtils ( as TezUtils is public ) 
\cc [~ebadger]

bq. Nit: Replace new TimerTask with new Runnable (TimerTask is not serving any 
purpose)

Done. 

bq. Question: The initial timer is only setup after the AM has recovered, 
correct?

Yes - setup of timer is done in DAGAppMaster::serviceStart after all services 
started and the heavy lifting of recovery is done and dag has just about 
started running.
 


> Support ability for AM to kill itself if there is no client heartbeating to it
> ------------------------------------------------------------------------------
>
>                 Key: TEZ-3405
>                 URL: https://issues.apache.org/jira/browse/TEZ-3405
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Gunther Hagleitner
>            Assignee: Hitesh Shah
>            Priority: Critical
>         Attachments: TEZ-3405.1.patch, TEZ-3405.2.patch, TEZ-3405.3.patch, 
> TEZ-3405.4.patch, TEZ-3405.5.patch
>
>
> HiveServer2 optionally maintains a pool of AMs in either Tez or LLAP mode. 
> This is done to amortize the cost of launching a Tez session.
> We also try in a shutdown hook to kill all these AMs when HS2 goes down. 
> However, there are cases where HS2 doesn't get the chance to kill these AMs 
> before it goes away. As a result these zombie AMs hang around until the 
> timeout kicks in.
> The trouble with the timeout is that we have to set it fairly high. Otherwise 
> the benefit of having pre-launched AMs obviously goes away (in a lightly 
> loaded cluster).
> So, if people kill/restart HS2 they often times run into situations where the 
> cluster/queue doesn't have any more capacity for AMs. They either have to 
> manually kill the zombies or wait.
> The request is therefore for Tez to maintain a heartbeat to the client. If 
> the client goes away the AM should exit. That way we can keep the AMs alive 
> for a long time regardless of activity and at the same time don't have to 
> worry about them if HS2 goes down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-3405) Support ability for AM to kill itself if there is no client heartbeating to it

Reply via email to