[jira] [Created] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.

Anand Mazumdar (JIRA) Tue, 10 May 2016 15:08:10 -0700

Anand Mazumdar created MESOS-5361:
-------------------------------------

             Summary: Consider introducing TCP KeepAlive for Libprocess sockets.
                 Key: MESOS-5361
                 URL: https://issues.apache.org/jira/browse/MESOS-5361
             Project: Mesos
          Issue Type: Improvement
          Components: libprocess
            Reporter: Anand Mazumdar



We currently don't use TCP KeepAlive's when creating sockets in libprocess. 
This might benefit master <-> scheduler, master <-> agent connections i.e. we 
can detect if any of them failed faster.

Currently, if the master process goes down. If for some reason the {{RST}} 
sequence did not reach the scheduler, the scheduler can only come to know about 
the disconnection when it tries to do a {{send}} itself. 

The default TCP keep alive values on Linux are a joke though:
{code}
. This means that the keepalive routines wait for two hours (7200 secs) before 
sending the first keepalive probe, and then resend it every 75 seconds. If no 
ACK response is received for nine consecutive times, the connection is marked 
as broken.
{code}

However, for long running instances of scheduler/agent this still can be 
beneficial. Also, operators might start tuning the values for their clusters 
explicitly once we start supporting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.

Reply via email to