[ 
https://issues.apache.org/jira/browse/MESOS-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239864#comment-15239864
 ] 

David Robinson commented on MESOS-5200:
---------------------------------------

My bad, yeah it is called.

> agent->master messages use temporary TCP connections
> ----------------------------------------------------
>
>                 Key: MESOS-5200
>                 URL: https://issues.apache.org/jira/browse/MESOS-5200
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: David Robinson
>
> Background info: When an agent is started it starts a background task 
> (libprocess process?) to detect the leading master. When the leading master 
> is detected (or changes) the [SocketManager's link() 
> method|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L1415]
>  [is 
> called|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L942] 
> and a TCP connection to the master is established. The connection is used by 
> the agent to send messages to the master, and the master, upon receiving a 
> RegisterSlaveMessage/ReregisterSlaveMessage, establishes another TCP 
> connection back to the agent. Each TCP connection is uni-directional, the 
> agent writes messages on one connection and reads messages from the other, 
> and the master reads/writes from the opposite ends of the connections.
> If the initial TCP connection to the master fails to be established then 
> temporary connections are used for all agent->master messages; each send() 
> causes a new TCP connection to be setup, the message sent, then the 
> connection torn down. If link() succeeds a persistent TCP connection is used 
> instead.
> If agents do not use ZK to detect the master then the master detector 
> "detects" the master immediately and attempts to connect immediately. The 
> master may not be listening for connections at the time, or it could be 
> overwhelmed w/ TCP connection attempts, therefore the initial TCP connection 
> attempt fails. The agent does not attempt to establish a new persistent 
> connection as link() is only called when a new master is detected, which only 
> occurs once unless ZK is used.
> It's possible for agents to overwhelm a master w/ TCP connections such that 
> agents cannot establish connections. When this occurs pong messages may not 
> be received by the master so the master shuts down agents thus killing any 
> tasks they were running. We have witnessed this scenario during scale/load 
> tests at Twitter.
> The problem is trivial to reproduce: configure an agent to use a certain 
> master (\-\-master=10.20.30.40:5050), start the agent, wait several minutes 
> then start the master. All the agent->master messages will occur over 
> temporary connections.
> The problem occurs less frequently in production because ZK is typically used 
> for master detection and a master only registers in ZK after it has started 
> listening on its socket. However, the scenario described above can also occur 
> when ZK is used – a thundering herd of 10,000+ slaves establishing TCP 
> connections to the master can result in some connection attempts failing and 
> agents using temporary connections.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to