[ 
https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7911:
------------------------------
    Sprint: Mesosphere Sprint 74

> Non-checkpointing framework's tasks should not be marked LOST when agent 
> disconnects.
> -------------------------------------------------------------------------------------
>
>                 Key: MESOS-7911
>                 URL: https://issues.apache.org/jira/browse/MESOS-7911
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Priority: Critical
>              Labels: reliability
>
> Currently, when framework with checkpointing disabled has tasks running on an 
> agent and that agent disconnects from the master, the master will mark those 
> tasks LOST and remove them from its memory. The assumption is that the agent 
> is disconnecting because it terminated.
> However, it's possible that this disconnection occurred due to a transient 
> loss of connectivity and the agent re-connects while never having terminated. 
> This case violates our assumption of there being no unknown tasks to the 
> master:
> ```
> void Master::reconcileKnownSlave(
>     Slave* slave,
>     const vector<ExecutorInfo>& executors,
>     const vector<Task>& tasks)
> {
>   ...
>   // TODO(bmahler): There's an implicit assumption here the slave
>   // cannot have tasks unknown to the master. This _should_ be the
>   // case since the causal relationship is:
>   //   slave removes task -> master removes task
>   // Add error logging for any violations of this assumption!
> ```
> As a result, the tasks would remain on the agent but the master would not 
> know about them!
> A more appropriate action here would be:
> (1) When an agent disconnects, mark the tasks as unreachable.
>   (a) If the framework is not partition aware, only show it the last known 
> task state.
>   (b) If the framework is partition aware, let it know that it's now 
> unreachable.
> (2) If the agent re-connects:
>   (a) And the agent had restarted, let the non-checkpointing framework know 
> its tasks are GONE/LOST.
>   (b) If the agent still holds the tasks, the tasks are restored as reachable.
> (3) If the agent gets removed:
>   (a) For partition aware non-checkpointing frameworks, let them know the 
> tasks are unreachable.
>   (b) For non partition aware non-checkpointing frameworks, let them know the 
> tasks are lost and kill them if the agent comes back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to