Ian Downes created MESOS-2052:
---------------------------------

             Summary: RunState::recover should always recover 'completed'
                 Key: MESOS-2052
                 URL: https://issues.apache.org/jira/browse/MESOS-2052
             Project: Mesos
          Issue Type: Bug
          Components: containerization, slave
    Affects Versions: 0.20.0
            Reporter: Ian Downes


RunState::recover() will return partial state if it cannot find or open the 
libprocess pid file. Specifically, it does not recover the 'completed' flag.

However, if the slave has removed the executor (because launch failed or the 
executor failed to register) the sentinel flag will be set and this fact should 
be recovered. This ensures that container recovery is not attempted later.

This was discovered when the LinuxLauncher failed to recover because it was 
asked to recover two containers with the same forkedPid. Investigation showed 
the executors both OOM'ed before registering, i.e., no libprocess pid file was 
present. However, the containerizer had detected the OOM, destroyed the 
container, and notified the slave which cleaned everything up: failing the task 
and calling removeExecutor (which writes the completed sentinel file.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to