[
https://issues.apache.org/jira/browse/MESOS-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239946#comment-15239946
]
Benjamin Mahler commented on MESOS-5200:
----------------------------------------
[~drobinson] Ok, in that case I'll go ahead and close this out a duplicate of
MESOS-1963.
> agent->master messages use temporary TCP connections
> ----------------------------------------------------
>
> Key: MESOS-5200
> URL: https://issues.apache.org/jira/browse/MESOS-5200
> Project: Mesos
> Issue Type: Bug
> Reporter: David Robinson
>
> Background info: When an agent is started it starts a background task
> (libprocess process?) to detect the leading master. When the leading master
> is detected (or changes) the [SocketManager's link()
> method|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L1415]
> [is
> called|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L942]
> and a TCP connection to the master is established. The connection is used by
> the agent to send messages to the master, and the master, upon receiving a
> RegisterSlaveMessage/ReregisterSlaveMessage, establishes another TCP
> connection back to the agent. Each TCP connection is uni-directional, the
> agent writes messages on one connection and reads messages from the other,
> and the master reads/writes from the opposite ends of the connections.
> If the initial TCP connection to the master fails to be established then
> temporary connections are used for all agent->master messages; each send()
> causes a new TCP connection to be setup, the message sent, then the
> connection torn down. If link() succeeds a persistent TCP connection is used
> instead.
> If agents do not use ZK to detect the master then the master detector
> "detects" the master immediately and attempts to connect immediately. The
> master may not be listening for connections at the time, or it could be
> overwhelmed w/ TCP connection attempts, therefore the initial TCP connection
> attempt fails. The agent does not attempt to establish a new persistent
> connection as link() is only called when a new master is detected, which only
> occurs once unless ZK is used.
> It's possible for agents to overwhelm a master w/ TCP connections such that
> agents cannot establish connections. When this occurs pong messages may not
> be received by the master so the master shuts down agents thus killing any
> tasks they were running. We have witnessed this scenario during scale/load
> tests at Twitter.
> The problem is trivial to reproduce: configure an agent to use a certain
> master (\-\-master=10.20.30.40:5050), start the agent, wait several minutes
> then start the master. All the agent->master messages will occur over
> temporary connections.
> The problem occurs less frequently in production because ZK is typically used
> for master detection and a master only registers in ZK after it has started
> listening on its socket. However, the scenario described above can also occur
> when ZK is used – a thundering herd of 10,000+ slaves establishing TCP
> connections to the master can result in some connection attempts failing and
> agents using temporary connections.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)