[ 
https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922871#comment-16922871
 ] 

Meng Zhu commented on MESOS-9750:
---------------------------------

Note, while this ticket makes the completed task with the nonterminal status 
list in the right place (i.e. completed tasks). However, it would result in a 
weird behavior where a completed task would have a nonterminal status e.g. 
TASK_RUNNING.

> Agent V1 GET_STATE response may report a complete executor's tasks as 
> non-terminal after a graceful agent shutdown
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9750
>                 URL: https://issues.apache.org/jira/browse/MESOS-9750
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, executor
>    Affects Versions: 1.6.0, 1.7.0, 1.8.0
>            Reporter: Joseph Wu
>            Assignee: Joseph Wu
>            Priority: Major
>              Labels: foundations
>             Fix For: 1.7.3, 1.8.1, 1.9.0
>
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
> /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on 
> {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent. 
> This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to 
> status updates).
> When the agent starts back up, the completed executor will be recovered and 
> shows up correctly  as a completed executor in {{/state}}.  However, if you 
> fetch the V1 {{GET_STATE}} result, there will be an entry in 
> {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
>   launched_tasks {
>     name: "test-task"
>     task_id {
>       value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>     }
>     framework_id {
>       value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
>     }
>     executor_id {
>       value: "default"
>     }
>     agent_id {
>       value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>     }
>     state: TASK_RUNNING
>     resources { ... }
>     resources { ... }
>     resources { ... }
>     resources { ... }
>     statuses {
>       task_id {
>         value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>       }
>       state: TASK_RUNNING
>       agent_id {
>         value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>       }
>       timestamp: 1556674758.2175469
>       executor_id {
>         value: "default"
>       }
>       source: SOURCE_EXECUTOR
>       uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
>       container_status { ... }
>     }
>   }
> }
> get_executors {
>   completed_executors {
>     executor_info {
>       executor_id {
>         value: "default"
>       }
>       command {
>         value: ""
>       }
>       framework_id {
>         value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
>       }
>     }
>   }
> }
> get_frameworks {
>   completed_frameworks {
>     framework_info {
>       user: "user"
>       name: "default"
>       id {
>         value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
>       }
>       checkpoint: true
>       hostname: "localhost"
>       principal: "test-principal"
>       capabilities {
>         type: MULTI_ROLE
>       }
>       capabilities {
>         type: RESERVATION_REFINEMENT
>       }
>       roles: "*"
>     }
>   }
> }
> {code}
> This happens because we combine executors and completed executors when 
> constructing the response.  The terminal task(s) with non-terminal updates 
> appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to