[ https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922871#comment-16922871 ]
Meng Zhu commented on MESOS-9750: --------------------------------- Note, while this ticket makes the completed task with the nonterminal status list in the right place (i.e. completed tasks). However, it would result in a weird behavior where a completed task would have a nonterminal status e.g. TASK_RUNNING. > Agent V1 GET_STATE response may report a complete executor's tasks as > non-terminal after a graceful agent shutdown > ------------------------------------------------------------------------------------------------------------------ > > Key: MESOS-9750 > URL: https://issues.apache.org/jira/browse/MESOS-9750 > Project: Mesos > Issue Type: Bug > Components: agent, executor > Affects Versions: 1.6.0, 1.7.0, 1.8.0 > Reporter: Joseph Wu > Assignee: Joseph Wu > Priority: Major > Labels: foundations > Fix For: 1.7.3, 1.8.1, 1.9.0 > > > When the following steps occur: > 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or > /master/machine/down). > 2) The executor is sent a kill, and the agent counts down on > {{executor_shutdown_grace_period}}. > 3) The executor exits, before all terminal status updates reach the agent. > This is more likely if {{executor_shutdown_grace_period}} passes. > This results in a completed executor, with non-terminal tasks (according to > status updates). > When the agent starts back up, the completed executor will be recovered and > shows up correctly as a completed executor in {{/state}}. However, if you > fetch the V1 {{GET_STATE}} result, there will be an entry in > {{launched_tasks}} even though nothing is running. > {code} > get_tasks { > launched_tasks { > name: "test-task" > task_id { > value: "dff5a155-47f1-4a71-9b92-30ca059ab456" > } > framework_id { > value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000" > } > executor_id { > value: "default" > } > agent_id { > value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0" > } > state: TASK_RUNNING > resources { ... } > resources { ... } > resources { ... } > resources { ... } > statuses { > task_id { > value: "dff5a155-47f1-4a71-9b92-30ca059ab456" > } > state: TASK_RUNNING > agent_id { > value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0" > } > timestamp: 1556674758.2175469 > executor_id { > value: "default" > } > source: SOURCE_EXECUTOR > uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224" > container_status { ... } > } > } > } > get_executors { > completed_executors { > executor_info { > executor_id { > value: "default" > } > command { > value: "" > } > framework_id { > value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000" > } > } > } > } > get_frameworks { > completed_frameworks { > framework_info { > user: "user" > name: "default" > id { > value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000" > } > checkpoint: true > hostname: "localhost" > principal: "test-principal" > capabilities { > type: MULTI_ROLE > } > capabilities { > type: RESERVATION_REFINEMENT > } > roles: "*" > } > } > } > {code} > This happens because we combine executors and completed executors when > constructing the response. The terminal task(s) with non-terminal updates > appear under completed executors. > https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756 -- This message was sent by Atlassian Jira (v8.3.2#803003)