[
https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283605#comment-15283605
]
haosdent commented on MESOS-5361:
---------------------------------
I see. XD
> Consider introducing TCP KeepAlive for Libprocess sockets.
> ----------------------------------------------------------
>
> Key: MESOS-5361
> URL: https://issues.apache.org/jira/browse/MESOS-5361
> Project: Mesos
> Issue Type: Improvement
> Components: libprocess
> Reporter: Anand Mazumdar
> Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess.
> This might benefit master - scheduler, master - agent connections i.e. we can
> detect if any of them failed faster.
> Currently, if the master process goes down. If for some reason the {{RST}}
> sequence did not reach the scheduler, the scheduler can only come to know
> about the disconnection when it tries to do a {{send}} itself.
> The default TCP keep alive values on Linux are of little use in a real world
> application:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs)
> before sending the first keepalive probe, and then resend it every 75
> seconds. If no ACK response is received for nine consecutive times, the
> connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be
> beneficial. Also, operators might start tuning the values for their clusters
> explicitly once we start supporting it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)