Ian Downes created MESOS-4092:
---------------------------------
Summary: Try to re-establish connection on ping timeouts with
agent before removing it
Key: MESOS-4092
URL: https://issues.apache.org/jira/browse/MESOS-4092
Project: Mesos
Issue Type: Improvement
Components: master
Affects Versions: 0.25.0
Reporter: Ian Downes
The SlaveObserver will trigger an agent to be removed after
{{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}.
This can occur because of transient network failures, e.g., gray failures of a
switch uplink exhibiting heavy or total packet loss. Some network architectures
are designed to tolerate such gray failures and support multiple paths between
hosts. This can be implemented with equal-cost multi-path routing (ECMP) where
flows are hashed by their 5-tuple to multiple possible uplinks. In such
networks re-establishing a TCP connection will almost certainly use a new
source port and thus will likely be hashed to a different uplink, avoiding the
failed uplink and re-establishing connectivity with the agent.
After failing to receive pongs the SlaveObserver should next try to
re-establish a TCP connection (with exponential back-off) before declaring the
agent as lost. This can avoid significant disruption where large numbers of
agents reached through a single failed link could be removed unnecessarily
while still ensuring that agents that are truly lost are recognized as such.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)