[jira] [Updated] (SAMZA-871) Implement heart-beat mechanism between JobCoordinator and all running containers

Chen Song (JIRA) Mon, 24 Apr 2017 13:41:17 -0700

     [ 
https://issues.apache.org/jira/browse/SAMZA-871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chen Song updated SAMZA-871:
----------------------------
    Description: 
Right now, Samza relies on YARN to detect whether a container is alive or not. 
This has a few problems:
1) with the effort to make standalone Samza (SAMZA-516) and make Samza more 
pluggable w/ other distributed cluster management system (like Mesos, 
Kubernetes), we need to make the container liveness detection independent.
2) YARN based liveness detection has also created problems w/ leaking 
containers when NM crashed. It creates a dilemma:
## In the case that NM can be restarted quickly, we would like to keep the 
container alive w/o being affected by NM goes down since that saves ongoing 
work. yarn.nodemanager.recovery.enabled=true
## However, when RM loses the heart beat from NM and determines that the 
container is "dead", we truly need to make sure to kill the container to avoid 
duplicate containers being launched, since AM has no other way to know whether 
the container is actually alive or not.

If we implement a direct heart beat mechanism between Samza JobCoordinator and 
SamzaContainer, we can be agnostic to whatever the YARN RM/NM/AM sync status is.

Possible approaches could be:
1) Use JobCoordinator HTTP port for heart beat. Pros: simple, synchronous 
communication. Cons: would potentially be a bottleneck in a job w/ a lot of 
containers, hard to tune the timeout value
2) Use CoordinatorStream as the heart beat channel. Pros: use async pub-sub 
model to avoid timeouts in sync methods, easy to scale to a large number of 
containers; Cons: protocol is more complex to implement, message/token delivery 
latency maybe uncertain and make the heart beat process much longer.

  was:
Right now, Samza relies on YARN to detect whether a container is alive or not. 
This has a few problems:
1) with the effort to make standalone Samza (SAMZA-516) and make Samza more 
pluggable w/ other distributed cluster management system (like Mesos, 
Kubernetes), we need to make the container liveness detection independent.
2) YARN based liveness detection has also created problems w/ leaking 
containers when NM crashed. It creates a dilemma:
 ## In the case that NM can be restarted quickly, we would like to keep the 
container alive w/o being affected by NM goes down since that saves ongoing 
work. yarn.nodemanager.recovery.enabled=true
## However, when RM loses the heart beat from NM and determines that the 
container is "dead", we truly need to make sure to kill the container to avoid 
duplicate containers being launched, since AM has no other way to know whether 
the container is actually alive or not.

If we implement a direct heart beat mechanism between Samza JobCoordinator and 
SamzaContainer, we can be agnostic to whatever the YARN RM/NM/AM sync status is.

Possible approaches could be:
1) Use JobCoordinator HTTP port for heart beat. Pros: simple, synchronous 
communication. Cons: would potentially be a bottleneck in a job w/ a lot of 
containers, hard to tune the timeout value
2) Use CoordinatorStream as the heart beat channel. Pros: use async pub-sub 
model to avoid timeouts in sync methods, easy to scale to a large number of 
containers; Cons: protocol is more complex to implement, message/token delivery 
latency maybe uncertain and make the heart beat process much longer.


> Implement heart-beat mechanism between JobCoordinator and all running 
> containers
> --------------------------------------------------------------------------------
>
>                 Key: SAMZA-871
>                 URL: https://issues.apache.org/jira/browse/SAMZA-871
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Yi Pan (Data Infrastructure)
>            Assignee: Chen Song
>
> Right now, Samza relies on YARN to detect whether a container is alive or 
> not. This has a few problems:
> 1) with the effort to make standalone Samza (SAMZA-516) and make Samza more 
> pluggable w/ other distributed cluster management system (like Mesos, 
> Kubernetes), we need to make the container liveness detection independent.
> 2) YARN based liveness detection has also created problems w/ leaking 
> containers when NM crashed. It creates a dilemma:
> ## In the case that NM can be restarted quickly, we would like to keep the 
> container alive w/o being affected by NM goes down since that saves ongoing 
> work. yarn.nodemanager.recovery.enabled=true
> ## However, when RM loses the heart beat from NM and determines that the 
> container is "dead", we truly need to make sure to kill the container to 
> avoid duplicate containers being launched, since AM has no other way to know 
> whether the container is actually alive or not.
> If we implement a direct heart beat mechanism between Samza JobCoordinator 
> and SamzaContainer, we can be agnostic to whatever the YARN RM/NM/AM sync 
> status is.
> Possible approaches could be:
> 1) Use JobCoordinator HTTP port for heart beat. Pros: simple, synchronous 
> communication. Cons: would potentially be a bottleneck in a job w/ a lot of 
> containers, hard to tune the timeout value
> 2) Use CoordinatorStream as the heart beat channel. Pros: use async pub-sub 
> model to avoid timeouts in sync methods, easy to scale to a large number of 
> containers; Cons: protocol is more complex to implement, message/token 
> delivery latency maybe uncertain and make the heart beat process much longer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (SAMZA-871) Implement heart-beat mechanism between JobCoordinator and all running containers

Reply via email to