Dominic Hamon created MESOS-1529:
------------------------------------

             Summary: Handle a network partition between Master and Slave
                 Key: MESOS-1529
                 URL: https://issues.apache.org/jira/browse/MESOS-1529
             Project: Mesos
          Issue Type: Bug
            Reporter: Dominic Hamon


If a network partition occurs between a Master and Slave, the Master will 
remove the Slave (as it fails health check) and mark the tasks being run there 
as LOST. However, the Slave is not aware that it has been removed so the tasks 
will continue to run.

There are at least two possible approaches to solving this issue:

1. Introduce a health check from Slave to Master so they have a consistent view 
of a network partition. We may still see this issue should a one-way connection 
error occur.

2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
Slave reappears and reconcile then. We'd still need to mark Slaves and tasks as 
potentially lost (zombie state) but maybe the Scheduler can make a more 
intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to