[jira] [Commented] (MESOS-10007) random "Failed to get exit status for Command" for short-lived commands

Benjamin Mahler (Jira) Mon, 21 Oct 2019 16:35:02 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956524#comment-16956524
 ]


Benjamin Mahler commented on MESOS-10007:
-----------------------------------------

Hi [~Charle], thanks for the nice ticket! I attached a patch that avoids the 
double reaping, please try that out and let me know if that works for your test 
case.

[~gilbert] [~qianzhang] note that the launcher also has a double reap, but it 
does not depend on the status, so the only issue in that case is if the pid is 
reused before SubprocessLauncher::destroy calls destroy. You may want to audit 
the containerization code for double reap issues.

> random "Failed to get exit status for Command" for short-lived commands
> -----------------------------------------------------------------------
>
>                 Key: MESOS-10007
>                 URL: https://issues.apache.org/jira/browse/MESOS-10007
>             Project: Mesos
>          Issue Type: Bug
>          Components: executor, libprocess
>            Reporter: Charles
>            Priority: Major
>              Labels: foundations
>         Attachments: 
> 0001-Avoid-double-reaping-race-in-the-command-executor.patch, 
> executor_race_reprod.diff, test_scheduler.py
>
>
> Hi,
> While testing Mesos to see if we could use it at work, I encountered a random 
> bug which I believe happens when a command exits really quickly, when run via 
> the command executor.
> See the attached test case, but basically all it does is constantly start 
> "exit 0" tasks.
> At some point, a task randomly fails with the error "Failed to get exit 
> status for Command":
>  
> {noformat}
> 'state': 'TASK_FAILED', 'message': 'Failed to get exit status for Command', 
> 'source': 'SOURCE_EXECUTOR',{noformat}
>   
> I've had a look at the code, and I found something which could potentially 
> explain it - it's the first time I look at the code so apologies if I'm 
> missing something.
>  We can see the error originates from `reaped`:
> [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L1017]
> {noformat}
>     } else if (status_->isNone()) {
>       taskState = TASK_FAILED;
>       message = "Failed to get exit status for Command";
>     } else {{noformat}
>  
> Looking at the code, we can see that the `status_` future can be set to 
> `None` in `ReaperProcess::reap`:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L69]
>  
>  
> {noformat}
> Future<Option<int>> ReaperProcess::reap(pid_t pid)
> {
>   // Check to see if this pid exists.
>   if (os::exists(pid)) {
>     Owned<Promise<Option<int>>> promise(new Promise<Option<int>>());
>     promises.put(pid, promise);
>     return promise->future();
>   } else {
>     return None();
>   }
> }{noformat}
>  
>  
> So we could have this if the process has already been reaped (`kill -0` will 
> fail).
>  
> Now, looking at the code path which spawns the process:
> `launchTaskSubprocess`
> [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L724]
>  
> calls `subprocess`:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L315]
>  
> If we look at the bottom of the function we can see the following:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462]
>  
>  
> {noformat}
>   // We need to bind a copy of this Subprocess into the onAny callback
>   // below to ensure that we don't close the file descriptors before
>   // the subprocess has terminated (i.e., because the caller doesn't
>   // keep a copy of this Subprocess around themselves).
>   process::reap(process.data->pid)
>     .onAny(lambda::bind(internal::cleanup, lambda::_1, promise, process));  
> return process;{noformat}
>  
>  
> So at this point we've already called `process::reap`.
>  
> And after that, the executor also calls `process::reap`:
> [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801]
>  
>  
> {noformat}
>     // Monitor this process.
>     process::reap(pid.get())
>       .onAny(defer(self(), &Self::reaped, pid.get(), lambda::_1));{noformat}
>  
>  
> But if we look at the implementation of `process::reap`:
> [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L152]
>  
>  
> {noformat}
> Future<Option<int>> reap(pid_t pid)
> {
>   // The reaper process is instantiated in `process::initialize`.
>   process::initialize();  return dispatch(
>       internal::reaper,
>       &internal::ReaperProcess::reap,
>       pid);
> }{noformat}
> We can see that `ReaperProcess::reap` is going to get called asynchronously.
>  
> Doesn't this mean that it's possible that the first call to `reap` set up by 
> `subprocess` 
> ([https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462)|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462]
> will get executed first, and if the task has already exited by that time, the 
> child will get reaped before the call to `reap` set up by the executor 
> ([https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801])
>  gets a chance to run?
>  
> In that case, when it runs
>  
> {noformat}
> if (os::exists(pid)) {{noformat}
> would return false, `reap` would set the future to None which would result in 
> this error.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MESOS-10007) random "Failed to get exit status for Command" for short-lived commands

Reply via email to