[ 
https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304189#comment-15304189
 ] 

Anand Mazumdar commented on MESOS-5468:
---------------------------------------

If for some reason, a framework gets disconnected from the master. The master 
gives it {{failover_timeout}} to register before removing it completely. 
https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L231

We currently don't specify a timeout value for the example long lived framework 
so it defaults to 0ns i.e. it would be removed as soon as it disconnects 
initially.

{noformat}
I0527 05:48:45.583395 13101 master.cpp:1396] Giving framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) 0ns to 
failover
{noformat}

I wasn't able to reproduce the socket closure issue on my end i.e. the socket 
is closed as soon as the master disconnects the long-lived-framework. 

Can you have a look into the reproduction steps on the JIRA and let me know if 
it's missing any steps?

{noformat}
$  ~  netstat -tpn | grep -i 5050
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.1.1:5050          127.0.0.1:45226         ESTABLISHED 
32402/lt-mesos-mast
tcp        0      0 127.0.0.1:45224         127.0.1.1:5050          ESTABLISHED 
961/lt-long-lived-f
tcp        0      0 127.0.0.1:45226         127.0.1.1:5050          ESTABLISHED 
961/lt-long-lived-f
tcp        0      0 127.0.1.1:5050          127.0.0.1:45224         ESTABLISHED 
32402/lt-mesos-mast
{noformat}

After following the steps on the JIRA i.e. the long running framework gets 
disconnected.

{noformat}
$ ~  netstat -tpn | grep -i 5050
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.0.1:45224         127.0.1.1:5050          TIME_WAIT   
-
tcp        0      0 127.0.0.1:45226         127.0.1.1:5050          TIME_WAIT   
-
{noformat}


> Add logic in long-lived-framework to handle network partitions.
> ---------------------------------------------------------------
>
>                 Key: MESOS-5468
>                 URL: https://issues.apache.org/jira/browse/MESOS-5468
>             Project: Mesos
>          Issue Type: Task
>          Components: framework, master
>            Reporter: Jay Guo
>
> Currently long-lived-framework does not handle network partitions i.e 
> explicitly trying to {{reconnect}} with the master upon not receiving 
> {{HEARTBEAT}} events for a prolonged amount of time. If the master 
> disconnects a framework without the framework being aware of it (one way 
> partition), the framework should explicitly issue a {{reconnect}} request via 
> the scheduler library after a certain period of time.
> *On the other hand*, should we close TCP socket on master side when teardown 
> a framework? Currently the tcp socket is left alive even framework has been 
> deactivated. This results in framework sending invalid {{Call}} to master and 
> re-detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to