[jira] [Comment Edited] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

Qian Zhang (JIRA) Sun, 28 Jan 2018 22:46:47 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341146#comment-16341146
 ]


Qian Zhang edited comment on MESOS-8125 at 1/29/18 6:45 AM:
------------------------------------------------------------

{quote}Also looks like docker containerizer doesn't recover the executor pid!?
{quote}
Yes, I have verified after agent recovery the `container->executorPid` is 
`None()`. We should have set it in `DockerContainerizerProcess::_recover`.
{quote}We should fix `_recover` to do `container->status.set(None())` when the 
container->pid is None(). 
{quote}
I think there are two cases that we need to handle:
 # Docker container was stopped when agent was down: In this case, when agent 
recovers, the `container->pid` will be `None()` in 
`DockerContainerizerProcess::_recover` (we can get such info from this method's 
second parameter `_containers`), and do `container->status.set(None())`.
 # Docker container was removed when agent was down: In this case, when agent 
recovers, we will not find the relevant Docker container from `_containers` in 
`DockerContainerizerProcess::_recover`, and we should do 
`container->status.set(None())` as well.

But there is still a case that the above solution can not handle: Launch a task 
using the Docker containerizer to run a Docker container with 
`---restart=always` option (it can be enabled with 
`ContainerInfo.DockerInfo.parameters`), and after agent node is rebooted, the 
pid of Docker executor happens to be reused by another process and the Docker 
container is still running due to the option `--r-estart=always`, now this task 
can never be killed because:
 # After agent reboot, the reregister executor timeout will be triggered 
because the process reusing the pid will never register with agent. As a 
result, agent will move the executor in `TERMINATING` state and destroy the 
container, but it can not complete because 
`DockerContainerizerProcess::__destroy` will wait on 
`container->status.future().get()` forever since the process reusing the pid 
will not be reaped although the Docker container will be stopped by 
`DockerContainerizerProcess::_destroy`. So the task will still be running and 
the executor will be in `TERMINATING` state, but the Docker container has 
already been stopped.
 # If framework issues a kill task after agent node is rebooted, such request 
will be ignored by agent because the executor is in `TERMINATING` state.


was (Author: qianzhang):
{quote}Also looks like docker containerizer doesn't recover the executor pid!?
{quote}
Yes, I have verified after agent recovery the `container->executorPid` is 
`None()`. We should have set it in `DockerContainerizerProcess::_recover`.
{quote}We should fix `_recover` to do `container->status.set(None())` when the 
container->pid is None(). 
{quote}
I think there are two cases that we need to handle:
 # Docker container was stopped when agent was down: In this case, when agent 
recovers, the `container->pid` will be `None()` in 
`DockerContainerizerProcess::_recover` (we can get such info from this method's 
second parameter `_containers`), and do `container->status.set(None())`.
 # Docker container was removed when agent was down: In this case, when agent 
recovers, we will not find the relevant Docker container from `_containers` in 
`DockerContainerizerProcess::_recover`, and we should do 
`container->status.set(None())` as well.

But there is still a case that the above solution can not handle: Launch a task 
using the Docker containerizer to run a Docker container with 
`--restart=always` option (it can be enabled with 
`ContainerInfo.DockerInfo.parameters`), and after agent node is rebooted, the 
pid of Docker executor happens to be reused by another process and the Docker 
container is still running due to the option `--restart=always`, now this task 
can never be killed because:
 # After agent reboot, the reregister executor timeout will be triggered 
because the process reusing the pid will never register with agent. As a 
result, agent will move the executor in `TERMINATING` state and destroy the 
container, but it can not complete because 
`DockerContainerizerProcess::__destroy` will wait on 
`container->status.future().get()` forever since the process reusing the pid 
will not be reaped although the Docker container will be stopped by 
`DockerContainerizerProcess::_destroy`. So the task will still be running and 
the executor will be in `TERMINATING` state, but the Docker container has 
already been stopped.
 # If framework issues a kill task after agent node is rebooted, such request 
will be ignored by agent because the executor is in `TERMINATING` state.

> Agent should properly handle recovering an executor when its pid is reused
> --------------------------------------------------------------------------
>
>                 Key: MESOS-8125
>                 URL: https://issues.apache.org/jira/browse/MESOS-8125
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Gastón Kleiman
>            Assignee: Qian Zhang
>            Priority: Critical
>
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen 
> with the command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta 
> directory, e.g., 
> {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
>  I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

Reply via email to