Gastón Kleiman created MESOS-8125:
-------------------------------------

             Summary: Agent shouldn't try to recover executors after a reboot
                 Key: MESOS-8125
                 URL: https://issues.apache.org/jira/browse/MESOS-8125
             Project: Mesos
          Issue Type: Bug
            Reporter: Gastón Kleiman


We know that all executors will be gone once the host on which an agent is 
running is rebooted, so there's no need to try to recover these executors.

Trying to recover stopped executors can lead to problems if another process is 
assigned the same pid that the executor had before the reboot. In this case the 
agent will unsuccessfully try to reregister with the executor, and then 
transition it to a {{TERMINATING}} state. The executor will sadly get stuck in 
that state, and the tasks that it started will get stuck in whatever state they 
were in at the time of the reboot.

One way of getting rid of stuck executors is to remove the {{latest}} symlink 
under {{work_dir/meta/slaves/latest/frameworks/<framework 
id>/executors/<executor id>/runs}.

Here's how to reproduce this issue:

# Start a task using the Docker containerizer (the same will probably happen 
with the command executor).
# Stop the corresponding Mesos agent while the task is running.
# Change the executor's checkpointed forked pid, which is located in the meta 
directory, e.g., 
{{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
 I used pid 2, which is normally used by {{kthreadd}}.
# Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to