[ https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kone updated MESOS-7911: ------------------------------ Sprint: Mesosphere Sprint 74 > Non-checkpointing framework's tasks should not be marked LOST when agent > disconnects. > ------------------------------------------------------------------------------------- > > Key: MESOS-7911 > URL: https://issues.apache.org/jira/browse/MESOS-7911 > Project: Mesos > Issue Type: Bug > Reporter: Benjamin Mahler > Priority: Critical > Labels: reliability > > Currently, when framework with checkpointing disabled has tasks running on an > agent and that agent disconnects from the master, the master will mark those > tasks LOST and remove them from its memory. The assumption is that the agent > is disconnecting because it terminated. > However, it's possible that this disconnection occurred due to a transient > loss of connectivity and the agent re-connects while never having terminated. > This case violates our assumption of there being no unknown tasks to the > master: > ``` > void Master::reconcileKnownSlave( > Slave* slave, > const vector<ExecutorInfo>& executors, > const vector<Task>& tasks) > { > ... > // TODO(bmahler): There's an implicit assumption here the slave > // cannot have tasks unknown to the master. This _should_ be the > // case since the causal relationship is: > // slave removes task -> master removes task > // Add error logging for any violations of this assumption! > ``` > As a result, the tasks would remain on the agent but the master would not > know about them! > A more appropriate action here would be: > (1) When an agent disconnects, mark the tasks as unreachable. > (a) If the framework is not partition aware, only show it the last known > task state. > (b) If the framework is partition aware, let it know that it's now > unreachable. > (2) If the agent re-connects: > (a) And the agent had restarted, let the non-checkpointing framework know > its tasks are GONE/LOST. > (b) If the agent still holds the tasks, the tasks are restored as reachable. > (3) If the agent gets removed: > (a) For partition aware non-checkpointing frameworks, let them know the > tasks are unreachable. > (b) For non partition aware non-checkpointing frameworks, let them know the > tasks are lost and kill them if the agent comes back. -- This message was sent by Atlassian JIRA (v7.6.3#76005)