[
https://issues.apache.org/jira/browse/SAMZA-871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Prateek Maheshwari resolved SAMZA-871.
--------------------------------------
Resolution: Fixed
> Implement heart-beat mechanism between JobCoordinator and all running
> containers
> --------------------------------------------------------------------------------
>
> Key: SAMZA-871
> URL: https://issues.apache.org/jira/browse/SAMZA-871
> Project: Samza
> Issue Type: Bug
> Reporter: Yi Pan (Data Infrastructure)
> Assignee: Abhishek Shivanna
> Fix For: 0.13.0
>
>
> Right now, Samza relies on YARN to detect whether a container is alive or
> not. This has a few problems:
> 1) with the effort to make standalone Samza (SAMZA-516) and make Samza more
> pluggable w/ other distributed cluster management system (like Mesos,
> Kubernetes), we need to make the container liveness detection independent.
> 2) YARN based liveness detection has also created problems w/ leaking
> containers when NM crashed. It creates a dilemma:
> ## In the case that NM can be restarted quickly, we would like to keep the
> container alive w/o being affected by NM goes down since that saves ongoing
> work. yarn.nodemanager.recovery.enabled=true
> ## However, when RM loses the heart beat from NM and determines that the
> container is "dead", we truly need to make sure to kill the container to
> avoid duplicate containers being launched, since AM has no other way to know
> whether the container is actually alive or not.
> If we implement a direct heart beat mechanism between Samza JobCoordinator
> and SamzaContainer, we can be agnostic to whatever the YARN RM/NM/AM sync
> status is.
> Possible approaches could be:
> 1) Use JobCoordinator HTTP port for heart beat. Pros: simple, synchronous
> communication. Cons: would potentially be a bottleneck in a job w/ a lot of
> containers, hard to tune the timeout value
> 2) Use CoordinatorStream as the heart beat channel. Pros: use async pub-sub
> model to avoid timeouts in sync methods, easy to scale to a large number of
> containers; Cons: protocol is more complex to implement, message/token
> delivery latency maybe uncertain and make the heart beat process much longer.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)