Charles created MESOS-10007:
-------------------------------
Summary: random "Failed to get exit status for Command" for
short-lived commands
Key: MESOS-10007
URL: https://issues.apache.org/jira/browse/MESOS-10007
Project: Mesos
Issue Type: Bug
Components: executor
Reporter: Charles
Attachments: test_scheduler.py
Hi,
While testing Mesos to see if we could use it at work, I encountered a random
bug which I believe happens when a command exits really quickly, when run via
the command executor.
See the attached test case, but basically all it does is constantly start "exit
0" tasks.
At some point, a task randomly fails with the error "Failed to get exit status
for Command":
{noformat}
'state': 'TASK_FAILED', 'message': 'Failed to get exit status for Command',
'source': 'SOURCE_EXECUTOR',{noformat}
I've had a look at the code, and I found something which could potentially
explain it - it's the first time I look at the code so apologies if I'm missing
something.
We can see the error originates from `reaped`:
[https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L1017]
{noformat}
} else if (status_->isNone()) {
taskState = TASK_FAILED;
message = "Failed to get exit status for Command";
} else {{noformat}
Looking at the code, we can see that the `status_` future can be set to `None`
in `ReaperProcess::reap`:
[https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L69]
{noformat}
Future<Option<int>> ReaperProcess::reap(pid_t pid)
{
// Check to see if this pid exists.
if (os::exists(pid)) {
Owned<Promise<Option<int>>> promise(new Promise<Option<int>>());
promises.put(pid, promise);
return promise->future();
} else {
return None();
}
}{noformat}
So we could have this if the process has already been reaped (`kill -0` will
fail).
Now, looking at the code path which spawns the process:
`launchTaskSubprocess`
[https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L724]
calls `subprocess`:
[https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L315]
If we look at the bottom of the function we can see the following:
[https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462]
{noformat}
// We need to bind a copy of this Subprocess into the onAny callback
// below to ensure that we don't close the file descriptors before
// the subprocess has terminated (i.e., because the caller doesn't
// keep a copy of this Subprocess around themselves).
process::reap(process.data->pid)
.onAny(lambda::bind(internal::cleanup, lambda::_1, promise, process));
return process;{noformat}
So at this point we've already called `process::reap`.
And after that, the executor also calls `process::reap`:
[https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801]
{noformat}
// Monitor this process.
process::reap(pid.get())
.onAny(defer(self(), &Self::reaped, pid.get(), lambda::_1));{noformat}
But if we look at the implementation of `process::reap`:
[https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L152]
{noformat}
Future<Option<int>> reap(pid_t pid)
{
// The reaper process is instantiated in `process::initialize`.
process::initialize(); return dispatch(
internal::reaper,
&internal::ReaperProcess::reap,
pid);
}{noformat}
We can see that `ReaperProcess::reap` is going to get called asynchronously.
Doesn't this mean that it's possible that the first call to `reap` set up by
`subprocess`
([https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462)|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462]
will get executed first, and if the task has already exited by that time, the
child will get reaped before the call to `reap` set up by the executor
([https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801])
gets a chance to run?
In that case, when it runs
{noformat}
if (os::exists(pid)) {{noformat}
would return false, `reap` would set the future to None which would result in
this error.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)