[
https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283593#comment-15283593
]
Anand Mazumdar commented on MESOS-5361:
---------------------------------------
- +1
- I was alluding to them as a joke due to the values being of little use in a
real world application and not to Linux's implementation of the age old RFC.
> Consider introducing TCP KeepAlive for Libprocess sockets.
> ----------------------------------------------------------
>
> Key: MESOS-5361
> URL: https://issues.apache.org/jira/browse/MESOS-5361
> Project: Mesos
> Issue Type: Improvement
> Components: libprocess
> Reporter: Anand Mazumdar
> Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess.
> This might benefit master - scheduler, master - agent connections i.e. we can
> detect if any of them failed faster.
> Currently, if the master process goes down. If for some reason the {{RST}}
> sequence did not reach the scheduler, the scheduler can only come to know
> about the disconnection when it tries to do a {{send}} itself.
> The default TCP keep alive values on Linux are a joke though:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs)
> before sending the first keepalive probe, and then resend it every 75
> seconds. If no ACK response is received for nine consecutive times, the
> connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be
> beneficial. Also, operators might start tuning the values for their clusters
> explicitly once we start supporting it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)