[
https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Rukletsov updated MESOS-7911:
---------------------------------------
Labels: reliability (was: )
> Non-checkpointing framework's tasks should not be marked LOST when agent
> disconnects.
> -------------------------------------------------------------------------------------
>
> Key: MESOS-7911
> URL: https://issues.apache.org/jira/browse/MESOS-7911
> Project: Mesos
> Issue Type: Bug
> Reporter: Benjamin Mahler
> Priority: Critical
> Labels: reliability
>
> Currently, when framework with checkpointing disabled has tasks running on an
> agent and that agent disconnects from the master, the master will mark those
> tasks LOST and remove them from its memory. The assumption is that the agent
> is disconnecting because it terminated.
> However, it's possible that this disconnection occurred due to a transient
> loss of connectivity and the agent re-connects while never having terminated.
> This case violates our assumption of there being no unknown tasks to the
> master:
> ```
> void Master::reconcileKnownSlave(
> Slave* slave,
> const vector<ExecutorInfo>& executors,
> const vector<Task>& tasks)
> {
> ...
> // TODO(bmahler): There's an implicit assumption here the slave
> // cannot have tasks unknown to the master. This _should_ be the
> // case since the causal relationship is:
> // slave removes task -> master removes task
> // Add error logging for any violations of this assumption!
> ```
> As a result, the tasks would remain on the agent but the master would not
> know about them!
> A more appropriate action here would be:
> (1) When an agent disconnects, mark the tasks as unreachable.
> (a) If the framework is not partition aware, only show it the last known
> task state.
> (b) If the framework is partition aware, let it know that it's now
> unreachable.
> (2) If the agent re-connects:
> (a) And the agent had restarted, let the non-checkpointing framework know
> its tasks are GONE/LOST.
> (b) If the agent still holds the tasks, the tasks are restored as reachable.
> (3) If the agent gets removed:
> (a) For partition aware non-checkpointing frameworks, let them know the
> tasks are unreachable.
> (b) For non partition aware non-checkpointing frameworks, let them know the
> tasks are lost and kill them if the agent comes back.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)