[
https://issues.apache.org/jira/browse/MESOS-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jie Yu updated MESOS-5330:
--------------------------
Fix Version/s: 0.28.3
0.27.4
> Agent should backoff before connecting to the master
> ----------------------------------------------------
>
> Key: MESOS-5330
> URL: https://issues.apache.org/jira/browse/MESOS-5330
> Project: Mesos
> Issue Type: Bug
> Reporter: David Robinson
> Assignee: David Robinson
> Fix For: 0.28.3, 1.0.0, 0.27.4
>
>
> When an agent is started it starts a background task (libprocess process?) to
> detect the leading master. When the leading master is detected (or changes)
> the [SocketManager's link() method is called and a TCP connection to the
> master is
> established|https://github.com/apache/mesos/blob/a138e2246a30c4b5c9bc3f7069ad12204dcaffbc/src/slave/slave.cpp#L954].
> The agent _then_ backs off before sending a ReRegisterSlave message via the
> newly established connection. The agent needs to backoff _before_ attempting
> to establish a TCP connection to the master, not before sending the first
> message over the connection.
> During scale tests at Twitter we discovered that agents can SYN flood the
> master upon leader changes, then the problem described in MESOS-5200 can
> occur where ephemeral connections are used, which exacerbates the problem.
> The end result is a lot of hosts setting up and tearing down TCP connections
> every slave_ping_timeout seconds (15 by default), connections failing to be
> established, hosts being marked as unhealthy and being shutdown. We observed
> ~800 passive TCP connections per second on the leading master during scale
> tests.
> The problem can be somewhat mitigated by tuning the kernel to handle a
> thundering herd of TCP connections, but ideally there would not be a
> thundering herd to begin with.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)