[ 
https://issues.apache.org/jira/browse/MESOS-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868633#comment-15868633
 ] 

Ian Downes commented on MESOS-4092:
-----------------------------------

[~bmahler] IIUC, this doesn't improve behavior for gray failures where the 
connection isn't actually bad? We believe this is a major contributor to slave 
timeouts and removals in our clusters.

cc - [~ipronin] fyi

> Try to re-establish connection on ping timeouts with agent before removing it
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-4092
>                 URL: https://issues.apache.org/jira/browse/MESOS-4092
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.25.0
>            Reporter: Ian Downes
>
> The SlaveObserver will trigger an agent to be removed after 
> {{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}. 
> This can occur because of transient network failures, e.g., gray failures of 
> a switch uplink exhibiting heavy or total packet loss. Some network 
> architectures are designed to tolerate such gray failures and support 
> multiple paths between hosts. This can be implemented with equal-cost 
> multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple 
> possible uplinks. In such networks re-establishing a TCP connection will 
> almost certainly use a new source port and thus will likely be hashed to a 
> different uplink, avoiding the failed uplink and re-establishing connectivity 
> with the agent.
> After failing to receive pongs the SlaveObserver should next try to 
> re-establish a TCP connection (with exponential back-off) before declaring 
> the agent as lost. This can avoid significant disruption where large numbers 
> of agents reached through a single failed link could be removed unnecessarily 
> while still ensuring that agents that are truly lost are recognized as such.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to