Currently, Mesos implements a hardcoded policy for handling
partitioned agents and tasks:

* agents are deemed to be partitioned when they fail health checks
(~75 seconds by default)
* partitioned agents are removed from the cluster. Frameworks receive
TASK_LOST for all tasks running on the removed agent.
* when the agent reconnects, the master instructs it to shutdown and
terminate all of its tasks.

This is problematic: framework authors would like to implement their
own partition-handling logic. To improve this situation, this design
doc proposes changing how the Mesos master handles partitions:

https://issues.apache.org/jira/browse/MESOS-5659

Feedback is very welcome!

Thanks,
Neil

Reply via email to