[ https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Megha Sharma reassigned MESOS-8125: ----------------------------------- Assignee: Megha Sharma > Agent should properly handle recovering an executor when its pid is reused > -------------------------------------------------------------------------- > > Key: MESOS-8125 > URL: https://issues.apache.org/jira/browse/MESOS-8125 > Project: Mesos > Issue Type: Bug > Reporter: Gastón Kleiman > Assignee: Megha Sharma > Priority: Critical > > We know that all executors will be gone once the host on which an agent is > running is rebooted, so there's no need to try to recover these executors. > Trying to recover stopped executors can lead to problems if another process > is assigned the same pid that the executor had before the reboot. In this > case the agent will unsuccessfully try to reregister with the executor, and > then transition it to a {{TERMINATING}} state. The executor will sadly get > stuck in that state, and the tasks that it started will get stuck in whatever > state they were in at the time of the reboot. > One way of getting rid of stuck executors is to remove the {{latest}} symlink > under {{work_dir/meta/slaves/latest/frameworks/<framework > id>/executors/<executor id>/runs}. > Here's how to reproduce this issue: > # Start a task using the Docker containerizer (the same will probably happen > with the command executor). > # Stop the corresponding Mesos agent while the task is running. > # Change the executor's checkpointed forked pid, which is located in the meta > directory, e.g., > {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}. > I used pid 2, which is normally used by {{kthreadd}}. > # Reboot the host -- This message was sent by Atlassian JIRA (v6.4.14#64029)