[ 
https://issues.apache.org/jira/browse/SAMZA-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981605#comment-15981605
 ] 

Abhishek Shivanna commented on SAMZA-871:
-----------------------------------------

[~capricornius] I have a prototype already working for this issue. I will 
submit the SEP and the patch in the next couple of days. I would love to get 
inputs from you if you are also interested in working on this.

> Implement heart-beat mechanism between JobCoordinator and all running 
> containers
> --------------------------------------------------------------------------------
>
>                 Key: SAMZA-871
>                 URL: https://issues.apache.org/jira/browse/SAMZA-871
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Yi Pan (Data Infrastructure)
>            Assignee: Chen Song
>
> Right now, Samza relies on YARN to detect whether a container is alive or 
> not. This has a few problems:
> 1) with the effort to make standalone Samza (SAMZA-516) and make Samza more 
> pluggable w/ other distributed cluster management system (like Mesos, 
> Kubernetes), we need to make the container liveness detection independent.
> 2) YARN based liveness detection has also created problems w/ leaking 
> containers when NM crashed. It creates a dilemma:
>  ## In the case that NM can be restarted quickly, we would like to keep the 
> container alive w/o being affected by NM goes down since that saves ongoing 
> work. yarn.nodemanager.recovery.enabled=true
> ## However, when RM loses the heart beat from NM and determines that the 
> container is "dead", we truly need to make sure to kill the container to 
> avoid duplicate containers being launched, since AM has no other way to know 
> whether the container is actually alive or not.
> If we implement a direct heart beat mechanism between Samza JobCoordinator 
> and SamzaContainer, we can be agnostic to whatever the YARN RM/NM/AM sync 
> status is.
> Possible approaches could be:
> 1) Use JobCoordinator HTTP port for heart beat. Pros: simple, synchronous 
> communication. Cons: would potentially be a bottleneck in a job w/ a lot of 
> containers, hard to tune the timeout value
> 2) Use CoordinatorStream as the heart beat channel. Pros: use async pub-sub 
> model to avoid timeouts in sync methods, easy to scale to a large number of 
> containers; Cons: protocol is more complex to implement, message/token 
> delivery latency maybe uncertain and make the heart beat process much longer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to