[ 
https://issues.apache.org/jira/browse/SAMZA-871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prateek Maheshwari resolved SAMZA-871.
--------------------------------------
    Resolution: Fixed

> Implement heart-beat mechanism between JobCoordinator and all running 
> containers
> --------------------------------------------------------------------------------
>
>                 Key: SAMZA-871
>                 URL: https://issues.apache.org/jira/browse/SAMZA-871
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Yi Pan (Data Infrastructure)
>            Assignee: Abhishek Shivanna
>             Fix For: 0.13.0
>
>
> Right now, Samza relies on YARN to detect whether a container is alive or 
> not. This has a few problems:
> 1) with the effort to make standalone Samza (SAMZA-516) and make Samza more 
> pluggable w/ other distributed cluster management system (like Mesos, 
> Kubernetes), we need to make the container liveness detection independent.
> 2) YARN based liveness detection has also created problems w/ leaking 
> containers when NM crashed. It creates a dilemma:
> ## In the case that NM can be restarted quickly, we would like to keep the 
> container alive w/o being affected by NM goes down since that saves ongoing 
> work. yarn.nodemanager.recovery.enabled=true
> ## However, when RM loses the heart beat from NM and determines that the 
> container is "dead", we truly need to make sure to kill the container to 
> avoid duplicate containers being launched, since AM has no other way to know 
> whether the container is actually alive or not.
> If we implement a direct heart beat mechanism between Samza JobCoordinator 
> and SamzaContainer, we can be agnostic to whatever the YARN RM/NM/AM sync 
> status is.
> Possible approaches could be:
> 1) Use JobCoordinator HTTP port for heart beat. Pros: simple, synchronous 
> communication. Cons: would potentially be a bottleneck in a job w/ a lot of 
> containers, hard to tune the timeout value
> 2) Use CoordinatorStream as the heart beat channel. Pros: use async pub-sub 
> model to avoid timeouts in sync methods, easy to scale to a large number of 
> containers; Cons: protocol is more complex to implement, message/token 
> delivery latency maybe uncertain and make the heart beat process much longer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to