[
https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinod Kone reassigned MESOS-7911:
---------------------------------
Shepherd: Vinod Kone
Assignee: Benno Evers
Story Points: 5
> Non-checkpointing framework's tasks should not be marked LOST when agent
> disconnects.
> -------------------------------------------------------------------------------------
>
> Key: MESOS-7911
> URL: https://issues.apache.org/jira/browse/MESOS-7911
> Project: Mesos
> Issue Type: Bug
> Reporter: Benjamin Mahler
> Assignee: Benno Evers
> Priority: Critical
> Labels: reliability
>
> Currently, when framework with checkpointing disabled has tasks running on an
> agent and that agent disconnects from the master, the master will mark those
> tasks LOST and remove them from its memory. The assumption is that the agent
> is disconnecting because it terminated.
> However, it's possible that this disconnection occurred due to a transient
> loss of connectivity and the agent re-connects while never having terminated.
> This case violates our assumption of there being no unknown tasks to the
> master:
> ```
> void Master::reconcileKnownSlave(
> Slave* slave,
> const vector<ExecutorInfo>& executors,
> const vector<Task>& tasks)
> {
> ...
> // TODO(bmahler): There's an implicit assumption here the slave
> // cannot have tasks unknown to the master. This _should_ be the
> // case since the causal relationship is:
> // slave removes task -> master removes task
> // Add error logging for any violations of this assumption!
> ```
> As a result, the tasks would remain on the agent but the master would not
> know about them!
> A more appropriate action here would be:
> # When an agent disconnects, mark the tasks as unreachable.
> ## If the framework is not partition aware, only show it the last known task
> state.
> ## If the framework is partition aware, let it know that it's now unreachable.
> # If the agent re-connects:
> ## And the agent had restarted, let the non-checkpointing framework know its
> tasks are GONE/LOST.
> ## If the agent still holds the tasks, the tasks are restored as reachable.
> # If the agent gets removed:
> ## For partition aware non-checkpointing frameworks, let them know the tasks
> are unreachable.
> ## For non partition aware non-checkpointing frameworks, let them know the
> tasks are lost and kill them if the agent comes back.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)