[jira] [Comment Edited] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

Vinod Kone (JIRA) Thu, 11 Jan 2018 16:22:36 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323279#comment-16323279
 ]


Vinod Kone edited comment on MESOS-8125 at 1/12/18 12:22 AM:
-------------------------------------------------------------

Did some investigation with [~xujyan] and [~megha.sharma]

Findings:

This issue doesn't seem to affect mesos containerizer because of the way 
launcher destroy short circuits when the cgroup doesn't exist.

In docker containerizer, we don't realize that a pid is invalid when the 
corresponding docker container is exited. We should fix `_recover` to do 
`container->status.set(None())` when the container->pid is None().

Also looks like docker containerizer doesn't recover the executor pid!? 


was (Author: vinodkone):
Did some investigation with [~xujyan] and [~megha.sharma]

Findings:

--> This issue doesn't seem to affect mesos containerizer because of the way 
launcher destroy short circuits when the cgroup doesn't exist.

--> In docker containerizer, we don't realize that a pid is invalid when the 
corresponding docker container is exited. We should fix `_recover` to do 
`container->status.set(None())` when the container->pid is None().

--> Also looks like docker containerizer doesn't recover the executor pid!? 

> Agent should properly handle recovering an executor when its pid is reused
> --------------------------------------------------------------------------
>
>                 Key: MESOS-8125
>                 URL: https://issues.apache.org/jira/browse/MESOS-8125
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Gastón Kleiman
>            Assignee: Megha Sharma
>            Priority: Critical
>
> We know that all executors will be gone once the host on which an agent is 
> running is rebooted, so there's no need to try to recover these executors.
> Trying to recover stopped executors can lead to problems if another process 
> is assigned the same pid that the executor had before the reboot. In this 
> case the agent will unsuccessfully try to reregister with the executor, and 
> then transition it to a {{TERMINATING}} state. The executor will sadly get 
> stuck in that state, and the tasks that it started will get stuck in whatever 
> state they were in at the time of the reboot.
> One way of getting rid of stuck executors is to remove the {{latest}} symlink 
> under {{work_dir/meta/slaves/latest/frameworks/<framework 
> id>/executors/<executor id>/runs}.
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen 
> with the command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta 
> directory, e.g., 
> {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
>  I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

Reply via email to