[ 
https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1668:
-----------------------------------


Placing this under reconciliation because, although extremely rare, it can lead 
to some inconsistent state between the master and slave for an arbitrary amount 
of time. For example, if the launchTask message is dropped as a result of the 
socket closure between Master → Slave in the scenario above.

> Handle a temporary one-way master --> slave socket closure.
> -----------------------------------------------------------
>
>                 Key: MESOS-1668
>                 URL: https://issues.apache.org/jira/browse/MESOS-1668
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, slave
>            Reporter: Benjamin Mahler
>            Priority: Minor
>              Labels: reliability
>
> In MESOS-1529, we realized that it's possible for a slave to remain 
> disconnected in the master if the following occurs:
> → Master and Slave connected operating normally.
> → Temporary one-way network failure, master→slave link breaks.
> → Master marks slave as disconnected.
> → Network restored and health checking continues normally, slave is not 
> removed as a result. Slave does not attempt to re-register since it is 
> receiving pings once again.
> → Slave remains disconnected according to the master, and the slave does not 
> try to re-register. Bad!
> We were originally thinking of using a failover timeout in the master to 
> remove these slaves that don't re-register. However, it can be dangerous when 
> ZooKeeper issues are preventing the slave from re-registering with the 
> master; we do not want to remove a ton of slaves in this situation.
> Rather, when the slave is health checking correctly but does not re-register 
> within a timeout, we could send a registration request from the master to the 
> slave, telling the slave that it must re-register. This message could also be 
> used when receiving status updates (or other messages) from slaves that are 
> disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to