[ https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ilya Pronin reassigned MESOS-5361: ---------------------------------- Assignee: Ilya Pronin > Consider introducing TCP KeepAlive for Libprocess sockets. > ---------------------------------------------------------- > > Key: MESOS-5361 > URL: https://issues.apache.org/jira/browse/MESOS-5361 > Project: Mesos > Issue Type: Improvement > Components: libprocess > Reporter: Anand Mazumdar > Assignee: Ilya Pronin > Labels: mesosphere > > We currently don't use TCP KeepAlive's when creating sockets in libprocess. > This might benefit master - scheduler, master - agent connections i.e. we can > detect if any of them failed faster. > Currently, if the master process goes down. If for some reason the {{RST}} > sequence did not reach the scheduler, the scheduler can only come to know > about the disconnection when it tries to do a {{send}} itself. > The default TCP keep alive values on Linux are of little use in a real world > application: > {code} > . This means that the keepalive routines wait for two hours (7200 secs) > before sending the first keepalive probe, and then resend it every 75 > seconds. If no ACK response is received for nine consecutive times, the > connection is marked as broken. > {code} > However, for long running instances of scheduler/agent this still can be > beneficial. Also, operators might start tuning the values for their clusters > explicitly once we start supporting it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)