[jira] [Updated] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.

Anand Mazumdar (JIRA) Thu, 26 May 2016 22:34:18 -0700

     [ 
https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anand Mazumdar updated MESOS-5468:
----------------------------------
    Description: 
Currently long-lived-framework does not handle network partitions i.e 
explicitly trying to {{reconnect}} with the master upon not receiving 
{{HEARTBEAT}} events for a prolonged amount of time. If the master disconnects 
a framework without the framework being aware of it (one way partition), the 
framework should explicitly issue a {{reconnect}} request via the scheduler 
library after a certain period of time.

*On the other hand*, should we close TCP socket on master side when teardown a 
framework? Currently the tcp socket is left alive even framework has been 
deactivated. This results in framework sending invalid {{Call}} to master and 
re-detection.

  was:
Currently long-lived-framework does not handle HEARTBEAT timeout. If master 
teardown the framework without framework being aware of it (network partition), 
the framework keeps waiting for {{Event}} until reconnected.

*On the other hand*, should we close TCP socket on master side when teardown a 
framework? Currently the tcp socket is left alive even framework has been 
deactivated. This results in framework sending invalid {{Call}} to master and 
re-detection.


> Add logic in long-lived-framework to handle network partitions.
> ---------------------------------------------------------------
>
>                 Key: MESOS-5468
>                 URL: https://issues.apache.org/jira/browse/MESOS-5468
>             Project: Mesos
>          Issue Type: Task
>          Components: framework, master
>            Reporter: Jay Guo
>
> Currently long-lived-framework does not handle network partitions i.e 
> explicitly trying to {{reconnect}} with the master upon not receiving 
> {{HEARTBEAT}} events for a prolonged amount of time. If the master 
> disconnects a framework without the framework being aware of it (one way 
> partition), the framework should explicitly issue a {{reconnect}} request via 
> the scheduler library after a certain period of time.
> *On the other hand*, should we close TCP socket on master side when teardown 
> a framework? Currently the tcp socket is left alive even framework has been 
> deactivated. This results in framework sending invalid {{Call}} to master and 
> re-detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.

Reply via email to