[ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044250#comment-14044250
 ] 

Benjamin Mahler commented on MESOS-1529:
----------------------------------------

(2) was originally required to solve an issue with respect to a temporary 
network failure. But after mulling it over again last night, this was not 
sufficient. The master currently relies on health checking to trigger the 
removal of a disconnected slave, which has the following issue (for example):

→ Master and Slave connected operating normally.
→ Temporary one-way network failure, master→slave link breaks.
→ Master marks slave as disconnected.
→ Network restored and health checking continues normally, slave is not removed 
as a result.
→ Slave remains disconnected according to the master, and the slave does not 
try to re-register. Bad!

Amended solution:

(1) On the master, add a failover timeout for slaves that disconnect. Slaves 
must re-register within the timeout.
(2) On the slave, when an 'exited' event arrives for the leading master, 
trigger re-registration.
(3) On the slave, when registered, ensure that we always receive a ping within 
t >= 75 seconds (master's health check timeout) when registered. If we don't 
receive a ping within the timeout, then trigger re-registration.

The idea is the same as above. But now, when the master marks a slave as 
disconnected, the slave must re-register within a timeout. (2) is not required 
for correctness, it is an optimization as [~vinodkone] mentioned.

Note that we still need to health check slaves that are disconnected, as 
before. Otherwise, if only relying on the slave failover timeout, then we may 
continually re-register→disconnect→re-register→disconnect→... a slave for which 
the one-way master→slave communication is broken.

> Handle a network partition between Master and Slave
> ---------------------------------------------------
>
>                 Key: MESOS-1529
>                 URL: https://issues.apache.org/jira/browse/MESOS-1529
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Dominic Hamon
>
> If a network partition occurs between a Master and Slave, the Master will 
> remove the Slave (as it fails health check) and mark the tasks being run 
> there as LOST. However, the Slave is not aware that it has been removed so 
> the tasks will continue to run.
> (To clarify a little bit: neither the master nor the slave receives 'exited' 
> event, indicating that the connection between the master and slave is not 
> closed).
> There are at least two possible approaches to solving this issue:
> 1. Introduce a health check from Slave to Master so they have a consistent 
> view of a network partition. We may still see this issue should a one-way 
> connection error occur.
> 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
> Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
> as potentially lost (zombie state) but maybe the Scheduler can make a more 
> intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to