[
https://issues.apache.org/jira/browse/MESOS-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15354210#comment-15354210
]
Benjamin Mahler commented on MESOS-4092:
----------------------------------------
FYI [~idownes] as part of MESOS-5576, we added the ability to force a
reconnection during link:
https://reviews.apache.org/r/49177/
> Try to re-establish connection on ping timeouts with agent before removing it
> -----------------------------------------------------------------------------
>
> Key: MESOS-4092
> URL: https://issues.apache.org/jira/browse/MESOS-4092
> Project: Mesos
> Issue Type: Improvement
> Components: master
> Affects Versions: 0.25.0
> Reporter: Ian Downes
>
> The SlaveObserver will trigger an agent to be removed after
> {{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}.
> This can occur because of transient network failures, e.g., gray failures of
> a switch uplink exhibiting heavy or total packet loss. Some network
> architectures are designed to tolerate such gray failures and support
> multiple paths between hosts. This can be implemented with equal-cost
> multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple
> possible uplinks. In such networks re-establishing a TCP connection will
> almost certainly use a new source port and thus will likely be hashed to a
> different uplink, avoiding the failed uplink and re-establishing connectivity
> with the agent.
> After failing to receive pongs the SlaveObserver should next try to
> re-establish a TCP connection (with exponential back-off) before declaring
> the agent as lost. This can avoid significant disruption where large numbers
> of agents reached through a single failed link could be removed unnecessarily
> while still ensuring that agents that are truly lost are recognized as such.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)