[ 
https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304360#comment-15304360
 ] 

Jay Guo commented on MESOS-5468:
--------------------------------

What is your iptables command? I can constantly reproduce the problem on latest 
build.

* How long does it take for master to disconnect the framework after network 
partition {{iptables command issued}}?

* Do tcp sockets go into FIN_WAIT_1 state?

I think the point is how does a master notice network partition? IIUC, it 
relies on tcp socket timeout, which is typically 13-30 min on a linux box 
(manpage of tcp), and that is the duration I experienced between disconnect and 
give-up. And at this point, tcp socket informs user (mesos-master) of broken 
link while remaining ESTABLISHED. It is up to the app now to handle this 
failure and I suspect that libprocess does not properly close the socket here. 
I'll need to do some more investigation.

I see other users experiencing {{Transport endpoint is not connected}} error 
and I personally see this for many times as well. So I think we should 
definitely take a serious look into that.

Another question, why don't we use a mature http library at the very beginning, 
instead of having our own implementation?

Cheers,
/J

> Add logic in long-lived-framework to handle network partitions.
> ---------------------------------------------------------------
>
>                 Key: MESOS-5468
>                 URL: https://issues.apache.org/jira/browse/MESOS-5468
>             Project: Mesos
>          Issue Type: Task
>          Components: framework, master
>            Reporter: Jay Guo
>
> Currently long-lived-framework does not handle network partitions i.e 
> explicitly trying to {{reconnect}} with the master upon not receiving 
> {{HEARTBEAT}} events for a prolonged amount of time. If the master 
> disconnects a framework without the framework being aware of it (one way 
> partition), the framework should explicitly issue a {{reconnect}} request via 
> the scheduler library after a certain period of time.
> *On the other hand*, should we close TCP socket on master side when teardown 
> a framework? Currently the tcp socket is left alive even framework has been 
> deactivated. This results in framework sending invalid {{Call}} to master and 
> re-detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to