Currently, Mesos implements a hardcoded policy for handling partitioned agents and tasks:
* agents are deemed to be partitioned when they fail health checks (~75 seconds by default) * partitioned agents are removed from the cluster. Frameworks receive TASK_LOST for all tasks running on the removed agent. * when the agent reconnects, the master instructs it to shutdown and terminate all of its tasks. This is problematic: framework authors would like to implement their own partition-handling logic. To improve this situation, this design doc proposes changing how the Mesos master handles partitions: https://issues.apache.org/jira/browse/MESOS-5659 Feedback is very welcome! Thanks, Neil
