Steven Schlansker created MESOS-5195:
----------------------------------------
Summary: Docker executor: task logs lost on shutdown
Key: MESOS-5195
URL: https://issues.apache.org/jira/browse/MESOS-5195
Project: Mesos
Issue Type: Bug
Components: containerization, docker
Affects Versions: 0.27.2
Environment: Linux 4.4.2 "Ubuntu 14.04.2 LTS"
Reporter: Steven Schlansker
When you try to kill a task running in the Docker executor (in our case via
Singularity), the task shuts down cleanly but the last logs to standard out /
standard error are lost in teardown.
For example, we run dumb-init. With debugging on, you can see it should write:
{noformat}
DEBUG("Forwarded signal %d to children.\n", signum);
{noformat}
If you attach strace to the process, you can see it clearly writes the text to
stderr. But that message is lost and never is written to the sandbox 'stderr'
file.
We believe the issue starts here, in Docker executor.cpp:
{code}
void shutdown(ExecutorDriver* driver)
{
cout << "Shutting down" << endl;
if (run.isSome() && !killed) {
// The docker daemon might still be in progress starting the
// container, therefore we kill both the docker run process
// and also ask the daemon to stop the container.
// Making a mutable copy of the future so we can call discard.
Future<Nothing>(run.get()).discard();
stop = docker->stop(containerName, stopTimeout);
killed = true;
}
}
{code}
Notice how the "run" future is discarded *before* the Docker daemon is told to
stop -- now what will discarding it do?
{code}
void commandDiscarded(const Subprocess& s, const string& cmd)
{
VLOG(1) << "'" << cmd << "' is being discarded";
os::killtree(s.pid(), SIGKILL);
}
{code}
Oops, just sent SIGKILL to the entire process tree...
You can see another (harmless?) side effect in the Docker daemon logs, it never
gets a chance to kill the task:
{noformat}
ERROR Handler for DELETE
/v1.22/containers/mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
returned error: No such container:
mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
{noformat}
I suspect that the fix is wait for 'docker->stop()' to complete before
discarding the 'run' future.
Happy to provide more information if necessary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)