[
https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dominic Hamon updated MESOS-1668:
---------------------------------
Sprint: Mesos Q3 Sprint 5, Mesos Q3 Sprint 6 (was: Mesos Q3 Sprint 5)
> Handle a temporary one-way master --> slave socket closure.
> -----------------------------------------------------------
>
> Key: MESOS-1668
> URL: https://issues.apache.org/jira/browse/MESOS-1668
> Project: Mesos
> Issue Type: Bug
> Components: master, slave
> Reporter: Benjamin Mahler
> Assignee: Vinod Kone
> Priority: Minor
> Labels: reliability
>
> In MESOS-1529, we realized that it's possible for a slave to remain
> disconnected in the master if the following occurs:
> → Master and Slave connected operating normally.
> → Temporary one-way network failure, master→slave link breaks.
> → Master marks slave as disconnected.
> → Network restored and health checking continues normally, slave is not
> removed as a result. Slave does not attempt to re-register since it is
> receiving pings once again.
> → Slave remains disconnected according to the master, and the slave does not
> try to re-register. Bad!
> We were originally thinking of using a failover timeout in the master to
> remove these slaves that don't re-register. However, it can be dangerous when
> ZooKeeper issues are preventing the slave from re-registering with the
> master; we do not want to remove a ton of slaves in this situation.
> Rather, when the slave is health checking correctly but does not re-register
> within a timeout, we could send a registration request from the master to the
> slave, telling the slave that it must re-register. This message could also be
> used when receiving status updates (or other messages) from slaves that are
> disconnected in the master.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)