[ 
https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-7911:
--------------------------------

    Assignee:     (was: Benno Evers)

> Non-checkpointing framework's tasks should not be marked LOST when agent 
> disconnects.
> -------------------------------------------------------------------------------------
>
>                 Key: MESOS-7911
>                 URL: https://issues.apache.org/jira/browse/MESOS-7911
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Priority: Critical
>              Labels: foundations, reliability
>
> Currently, when framework with checkpointing disabled has tasks running on an 
> agent and that agent disconnects from the master, the master will mark those 
> tasks LOST and remove them from its memory. The assumption is that the agent 
> is disconnecting because it terminated.
> However, it's possible that this disconnection occurred due to a transient 
> loss of connectivity and the agent re-connects while never having terminated. 
> This case violates our assumption of there being no unknown tasks to the 
> master:
> ```
>  void Master::reconcileKnownSlave(
>  Slave* slave,
>  const vector<ExecutorInfo>& executors,
>  const vector<Task>& tasks)
>  {
>  ...
> // TODO(bmahler): There's an implicit assumption here the slave
>  // cannot have tasks unknown to the master. This _should_ be the
>  // case since the causal relationship is:
>  // slave removes task -> master removes task
>  // Add error logging for any violations of this assumption!
>  ```
> As a result, the tasks would remain on the agent but the master would not 
> know about them!
> A more appropriate action here would be:
> # When an agent disconnects, mark the tasks as unreachable.
> ## If the framework is not partition aware, only show it the last known task 
> state.
> ## If the framework is partition aware, let it know that it's now unreachable.
> # If the agent re-connects:
> ## And the agent had restarted, let the non-checkpointing framework know its 
> tasks are GONE/LOST.
> ## If the agent still holds the tasks, the tasks are restored as reachable.
> # If the agent gets removed:
> ## For partition aware non-checkpointing frameworks, let them know the tasks 
> are unreachable.
> ## For non partition aware non-checkpointing frameworks, let them know the 
> tasks are lost and kill them if the agent comes back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to