Neil Conway created MESOS-6608:
----------------------------------
Summary: Do not transition tasks to TASK_KILLED on framework
teardown
Key: MESOS-6608
URL: https://issues.apache.org/jira/browse/MESOS-6608
Project: Mesos
Issue Type: Bug
Components: master
Reporter: Neil Conway
When a framework is torn down or disconnects, we currently transition the
framework's tasks to state TASK_KILLED at the master. See
* https://reviews.apache.org/r/25250
* MESOS-1736
This happens at the master; concurrently, the master sends a
{{ShutdownFrameworkMessage}} to each agent that is running one of the
framework's tasks.
Marking the task KILLED in this manner is problematic for two reasons:
# The task is still running and may continue running for an unbounded length of
time if the agent becomes partitioned.
# KILLED is usually used to denote tasks that are killed in response to a "kill
task" operation.
My primary concern here is #1. We could pick a different terminal state to
address #2 but I think that is secondary: transitioning the task to _any_
terminal state before it has been terminated is problematic, in my view.
Proposed behavior: when the framework teardown is applied, we keep the task in
its current state at the master. Then when the agent receives the
{{ShutdownFrameworkMessage}}, it can shutdown the task and eventually respond
with a terminal status update. At that point we can transition the task into
the appropriate terminal state (whether it be KILLED, FAILED, GONE, or a new
state).
This will probably require some changes to the status update machinery, since
we currently drop status updates for terminating frameworks at the slave. Since
the scheduler is gone, we'd need to have the master ack the status update
rather than the framework.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)