[ https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341146#comment-16341146 ]
Qian Zhang commented on MESOS-8125: ----------------------------------- {quote}Also looks like docker containerizer doesn't recover the executor pid!? {quote} Yes, I have verified after agent recovery the `container->executorPid` is `None()`. We should have set it in `DockerContainerizerProcess::_recover`. {quote}We should fix `_recover` to do `container->status.set(None())` when the container->pid is None(). {quote} I think there are two cases that we need to handle: # Docker container was stopped when agent was down: In this case, when agent recovers, the `container-pid` will be `None()` in `DockerContainerizerProcess::_recover` (we can get such info from this method's second parameter `_containers`), and do `container->status.set(None())`. # Docker container was removed when agent was down: In this case, when agent recovers, we will not find the relevant Docker container from `_containers` in `DockerContainerizerProcess::_recover`, and we should do `container->status.set(None())` as well. > Agent should properly handle recovering an executor when its pid is reused > -------------------------------------------------------------------------- > > Key: MESOS-8125 > URL: https://issues.apache.org/jira/browse/MESOS-8125 > Project: Mesos > Issue Type: Bug > Reporter: Gastón Kleiman > Priority: Critical > > Here's how to reproduce this issue: > # Start a task using the Docker containerizer (the same will probably happen > with the command executor). > # Stop the corresponding Mesos agent while the task is running. > # Change the executor's checkpointed forked pid, which is located in the meta > directory, e.g., > {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}. > I used pid 2, which is normally used by {{kthreadd}}. > # Reboot the host -- This message was sent by Atlassian JIRA (v7.6.3#76005)