[ https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Greg Mann reassigned MESOS-7911: -------------------------------- Assignee: (was: Benno Evers) > Non-checkpointing framework's tasks should not be marked LOST when agent > disconnects. > ------------------------------------------------------------------------------------- > > Key: MESOS-7911 > URL: https://issues.apache.org/jira/browse/MESOS-7911 > Project: Mesos > Issue Type: Bug > Reporter: Benjamin Mahler > Priority: Critical > Labels: foundations, reliability > > Currently, when framework with checkpointing disabled has tasks running on an > agent and that agent disconnects from the master, the master will mark those > tasks LOST and remove them from its memory. The assumption is that the agent > is disconnecting because it terminated. > However, it's possible that this disconnection occurred due to a transient > loss of connectivity and the agent re-connects while never having terminated. > This case violates our assumption of there being no unknown tasks to the > master: > ``` > void Master::reconcileKnownSlave( > Slave* slave, > const vector<ExecutorInfo>& executors, > const vector<Task>& tasks) > { > ... > // TODO(bmahler): There's an implicit assumption here the slave > // cannot have tasks unknown to the master. This _should_ be the > // case since the causal relationship is: > // slave removes task -> master removes task > // Add error logging for any violations of this assumption! > ``` > As a result, the tasks would remain on the agent but the master would not > know about them! > A more appropriate action here would be: > # When an agent disconnects, mark the tasks as unreachable. > ## If the framework is not partition aware, only show it the last known task > state. > ## If the framework is partition aware, let it know that it's now unreachable. > # If the agent re-connects: > ## And the agent had restarted, let the non-checkpointing framework know its > tasks are GONE/LOST. > ## If the agent still holds the tasks, the tasks are restored as reachable. > # If the agent gets removed: > ## For partition aware non-checkpointing frameworks, let them know the tasks > are unreachable. > ## For non partition aware non-checkpointing frameworks, let them know the > tasks are lost and kill them if the agent comes back. -- This message was sent by Atlassian JIRA (v7.6.3#76005)