Ian Downes created MESOS-2052:
---------------------------------
Summary: RunState::recover should always recover 'completed'
Key: MESOS-2052
URL: https://issues.apache.org/jira/browse/MESOS-2052
Project: Mesos
Issue Type: Bug
Components: containerization, slave
Affects Versions: 0.20.0
Reporter: Ian Downes
RunState::recover() will return partial state if it cannot find or open the
libprocess pid file. Specifically, it does not recover the 'completed' flag.
However, if the slave has removed the executor (because launch failed or the
executor failed to register) the sentinel flag will be set and this fact should
be recovered. This ensures that container recovery is not attempted later.
This was discovered when the LinuxLauncher failed to recover because it was
asked to recover two containers with the same forkedPid. Investigation showed
the executors both OOM'ed before registering, i.e., no libprocess pid file was
present. However, the containerizer had detected the OOM, destroyed the
container, and notified the slave which cleaned everything up: failing the task
and calling removeExecutor (which writes the completed sentinel file.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)