Dominic Hamon created MESOS-1529:
------------------------------------
Summary: Handle a network partition between Master and Slave
Key: MESOS-1529
URL: https://issues.apache.org/jira/browse/MESOS-1529
Project: Mesos
Issue Type: Bug
Reporter: Dominic Hamon
If a network partition occurs between a Master and Slave, the Master will
remove the Slave (as it fails health check) and mark the tasks being run there
as LOST. However, the Slave is not aware that it has been removed so the tasks
will continue to run.
There are at least two possible approaches to solving this issue:
1. Introduce a health check from Slave to Master so they have a consistent view
of a network partition. We may still see this issue should a one-way connection
error occur.
2. Be less aggressive about marking tasks and Slaves as lost. Wait until the
Slave reappears and reconcile then. We'd still need to mark Slaves and tasks as
potentially lost (zombie state) but maybe the Scheduler can make a more
intelligent decision.
--
This message was sent by Atlassian JIRA
(v6.2#6252)