[
https://issues.apache.org/jira/browse/MESOS-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037144#comment-15037144
]
Klaus Ma commented on MESOS-4048:
---------------------------------
My understanding is that: {{max_slave_ping_timeouts}} + {{slave_ping_timeout}}
is used to trigger TCP disconnected event; so master wait
{{slave_reregister_timeout}} for slave to re-register. If master got TCP
disconnected event, it should not ping Slave by {{max_slave_ping_timeouts}} +
{{slave_ping_timeout}}.
{{max_slave_ping_timeouts}} + {{slave_ping_timeout}} is used to simulate
TCP-KeepAlive which is not well supported in some OS.
> Consider unifying slave timeout behavior between steady state and master
> failover
> ---------------------------------------------------------------------------------
>
> Key: MESOS-4048
> URL: https://issues.apache.org/jira/browse/MESOS-4048
> Project: Mesos
> Issue Type: Improvement
> Components: master, slave
> Reporter: Neil Conway
> Priority: Minor
> Labels: mesosphere
>
> Currently, there are two timeouts that control what happens when an agent is
> partitioned from the master:
> 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the
> master waits before declaring a slave to be dead in the "steady state"
> 2. {{slave_reregister_timeout}} controls how long the master waits for a
> slave to reregister after master failover.
> It is unclear whether these two cases really merit being treated differently
> -- it might be simpler for operators to configure a single timeout that
> controls how long the master waits before declaring that a slave is dead.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)