Till Toenshoff created MESOS-1431:
-------------------------------------
Summary: io::splice usage needs special care - especially in
connection with process::subprocess
Key: MESOS-1431
URL: https://issues.apache.org/jira/browse/MESOS-1431
Project: Mesos
Issue Type: Bug
Affects Versions: 0.19.0
Reporter: Till Toenshoff
When using {{process::subproces}} in connection with {{io::splice}}, make sure
you work with extra care.
Consider this piece of code (ripped from EC), which looks rather unspectacular
- very straightforward use of subprocess and splice.
{noformat}
// Fork exec of external process.
Try<Subprocess> external = process::subprocess(
execute,
environment);
if (external.isError()) {
return Error("Failed to execute external containerizer: " +
external.error());
}
// Set stderr into non-blocking mode.
Try<Nothing> nonblock = os::nonblock(external.get().err());
if (nonblock.isError()) {
return Error("Failed to accept nonblock: " + nonblock.error());
}
// Redirect output (stderr) from the subprocess to a log
// file.
Try<int> err = os::open(
sandbox.isSome() ? path::join(sandbox.get().directory, "stderr")
: "/dev/null",
O_WRONLY | O_CREAT | O_APPEND | O_NONBLOCK,
S_IRUSR | S_IWUSR | S_IRGRP | S_IRWXO);
if (err.isError()) {
return Error(
"Failed to redirect stderr: Failed to open: " +
err.error());
}
Try<Nothing> cloexec = os::cloexec(err.get());
if (cloexec.isError()) {
os::close(err.get());
return Error(
"Failed to redirect stderr: Failed to set close-on-exec: " +
cloexec.error());
}
io::splice(external.get().err(), err.get())
.onAny(&os::close, err.get()));
{noformat}
The above code is buggy as subprocess will close {{external.get().err()}} once
the child got reaped. So the FD we are reading (splicing) from potentially gets
closed before the splicer was able to get the {{EOF}}. That in turn will cause
libev to continue polling that FD (remember, it never reached a final state)
and that is where havoc breaks lose as we will now see data getting lost that
is sent from reused FDs.
If you do not {{dup}} the subprocess returned FDs before using an
{{io::splice}} on it, thus not taking full ownership of the FD, you will end up
with fancy problems.
So a fix would be replacing the last two lines by this:
{noformat}
// Duplicate 'from' so that we're in control of it's lifetime,
// exceeding the lifetime of the reaped subprocess. It is needed
// to make sure the splicer had a chance to read and process the
// EOF.
int fd = dup(external.get().err());
if (fd == -1) {
os::close(err.get());
return ErrnoError("Failed to duplicate stderr file descriptor");
}
io::splice(fd, err.get())
.onAny(bind(&_invoke, fd, err.get()));
{noformat}
And also adding an onAny triggered function (_invoke in this case) that is
closing both, the dup'ed FD and the log-file FD.
The bad news are that the EC is not the only implementation that contains such
buggy code. There is at least one more such bug found within
mesos_containerizer.cpp:680 and we should now all check for more such use-cases
and fix them before releasing.
--
This message was sent by Atlassian JIRA
(v6.2#6252)