[ 
https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839373#comment-16839373
 ] 

Joseph Wu commented on MESOS-9750:
----------------------------------

Found one more code path where the agent's {{GET_STATE}} will return extraneous 
"launched_tasks".

This happens when a Framework or Master {{TEARDOWN}} call is used and the 
executor does not send a terminal status update in time.  This one does not 
require an agent restart/shutdown.
Also, this code path will result in an executor's checkpointed state looking 
identical to the agent shutdown case.  If the agent is restarted, the code in 
the above patch will be run to put the agent back into a consistent state.

Fix and test here: https://reviews.apache.org/r/70641/

> Agent V1 GET_STATE response may report a complete executor's tasks as 
> non-terminal after a graceful agent shutdown
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9750
>                 URL: https://issues.apache.org/jira/browse/MESOS-9750
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, executor
>    Affects Versions: 1.6.0, 1.7.0, 1.8.0
>            Reporter: Joseph Wu
>            Assignee: Joseph Wu
>            Priority: Major
>              Labels: foundations
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
> /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on 
> {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent. 
> This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to 
> status updates).
> When the agent starts back up, the completed executor will be recovered and 
> shows up correctly  as a completed executor in {{/state}}.  However, if you 
> fetch the V1 {{GET_STATE}} result, there will be an entry in 
> {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
>   launched_tasks {
>     name: "test-task"
>     task_id {
>       value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>     }
>     framework_id {
>       value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
>     }
>     executor_id {
>       value: "default"
>     }
>     agent_id {
>       value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>     }
>     state: TASK_RUNNING
>     resources { ... }
>     resources { ... }
>     resources { ... }
>     resources { ... }
>     statuses {
>       task_id {
>         value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>       }
>       state: TASK_RUNNING
>       agent_id {
>         value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>       }
>       timestamp: 1556674758.2175469
>       executor_id {
>         value: "default"
>       }
>       source: SOURCE_EXECUTOR
>       uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
>       container_status { ... }
>     }
>   }
> }
> get_executors {
>   completed_executors {
>     executor_info {
>       executor_id {
>         value: "default"
>       }
>       command {
>         value: ""
>       }
>       framework_id {
>         value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
>       }
>     }
>   }
> }
> get_frameworks {
>   completed_frameworks {
>     framework_info {
>       user: "user"
>       name: "default"
>       id {
>         value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-0000"
>       }
>       checkpoint: true
>       hostname: "localhost"
>       principal: "test-principal"
>       capabilities {
>         type: MULTI_ROLE
>       }
>       capabilities {
>         type: RESERVATION_REFINEMENT
>       }
>       roles: "*"
>     }
>   }
> }
> {code}
> This happens because we combine executors and completed executors when 
> constructing the response.  The terminal task(s) with non-terminal updates 
> appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to