[ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042918#comment-14042918
 ] 

Benjamin Mahler commented on MESOS-1529:
----------------------------------------

[~vinodkone] and I chatted extensively about the various solutions proposed and 
have settled on the following as a solution, please let us know if you see any 
potential for issues:

(1) Leave the master's ping/pong health checking unchanged.
(2) On the slave, when an 'exited' event arrives for the leading master, 
trigger re-registration.
(3) On the slave, when registered, ensure that we always receive a ping within 
t >= 75 seconds (master's health check timeout) when registered. If we don't 
receive a ping within the timeout, then trigger re-registration.

The idea here is to ensure that if there is no communication arriving from the 
master (or the socket closes), the slave should no longer consider itself 
registered. At this point, the slave must try to re-register with the master; 
this allows the master to determine if the slave can be re-registered or must 
be shut down.

> Handle a network partition between Master and Slave
> ---------------------------------------------------
>
>                 Key: MESOS-1529
>                 URL: https://issues.apache.org/jira/browse/MESOS-1529
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Dominic Hamon
>
> If a network partition occurs between a Master and Slave, the Master will 
> remove the Slave (as it fails health check) and mark the tasks being run 
> there as LOST. However, the Slave is not aware that it has been removed so 
> the tasks will continue to run.
> (To clarify a little bit: neither the master nor the slave receives 'exited' 
> event, indicating that the connection between the master and slave is not 
> closed).
> There are at least two possible approaches to solving this issue:
> 1. Introduce a health check from Slave to Master so they have a consistent 
> view of a network partition. We may still see this issue should a one-way 
> connection error occur.
> 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
> Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
> as potentially lost (zombie state) but maybe the Scheduler can make a more 
> intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to