Charles created MESOS-10007:

             Summary: random "Failed to get exit status for Command" for 
short-lived commands
                 Key: MESOS-10007
             Project: Mesos
          Issue Type: Bug
          Components: executor
            Reporter: Charles


While testing Mesos to see if we could use it at work, I encountered a random 
bug which I believe happens when a command exits really quickly, when run via 
the command executor.

See the attached test case, but basically all it does is constantly start "exit 
0" tasks.

At some point, a task randomly fails with the error "Failed to get exit status 
for Command":

'state': 'TASK_FAILED', 'message': 'Failed to get exit status for Command', 
'source': 'SOURCE_EXECUTOR',{noformat}

I've had a look at the code, and I found something which could potentially 
explain it - it's the first time I look at the code so apologies if I'm missing 

 We can see the error originates from `reaped`:

    } else if (status_->isNone()) {
      taskState = TASK_FAILED;
      message = "Failed to get exit status for Command";
    } else {{noformat}

Looking at the code, we can see that the `status_` future can be set to `None` 
in `ReaperProcess::reap`:



Future<Option<int>> ReaperProcess::reap(pid_t pid)
  // Check to see if this pid exists.
  if (os::exists(pid)) {
    Owned<Promise<Option<int>>> promise(new Promise<Option<int>>());
    promises.put(pid, promise);
    return promise->future();
  } else {
    return None();


So we could have this if the process has already been reaped (`kill -0` will 


Now, looking at the code path which spawns the process:




calls `subprocess`:



If we look at the bottom of the function we can see the following:



  // We need to bind a copy of this Subprocess into the onAny callback
  // below to ensure that we don't close the file descriptors before
  // the subprocess has terminated (i.e., because the caller doesn't
  // keep a copy of this Subprocess around themselves).
    .onAny(lambda::bind(internal::cleanup, lambda::_1, promise, process));  
return process;{noformat}


So at this point we've already called `process::reap`.


And after that, the executor also calls `process::reap`:



    // Monitor this process.
      .onAny(defer(self(), &Self::reaped, pid.get(), lambda::_1));{noformat}


But if we look at the implementation of `process::reap`:



Future<Option<int>> reap(pid_t pid)
  // The reaper process is instantiated in `process::initialize`.
  process::initialize();  return dispatch(
We can see that `ReaperProcess::reap` is going to get called asynchronously.


Doesn't this mean that it's possible that the first call to `reap` set up by 

will get executed first, and if the task has already exited by that time, the 
child will get reaped before the call to `reap` set up by the executor 
gets a chance to run?


In that case, when it runs

if (os::exists(pid)) {{noformat}
would return false, `reap` would set the future to None which would result in 
this error.


This message was sent by Atlassian Jira

Reply via email to