[
https://issues.apache.org/jira/browse/MESOS-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Mahler reassigned MESOS-5195:
--------------------------------------
Assignee: Benjamin Mahler
This looks to be a duplicate of MESOS-4279, I'll take this on since we don't
currently have a responsive maintainer for the docker support.
> Docker executor: task logs lost on shutdown
> -------------------------------------------
>
> Key: MESOS-5195
> URL: https://issues.apache.org/jira/browse/MESOS-5195
> Project: Mesos
> Issue Type: Bug
> Components: containerization, docker
> Affects Versions: 0.27.2
> Environment: Linux 4.4.2 "Ubuntu 14.04.2 LTS"
> Reporter: Steven Schlansker
> Assignee: Benjamin Mahler
> Fix For: 1.0.0
>
>
> When you try to kill a task running in the Docker executor (in our case via
> Singularity), the task shuts down cleanly but the last logs to standard out /
> standard error are lost in teardown.
> For example, we run dumb-init. With debugging on, you can see it should
> write:
> {noformat}
> DEBUG("Forwarded signal %d to children.\n", signum);
> {noformat}
> If you attach strace to the process, you can see it clearly writes the text
> to stderr. But that message is lost and never is written to the sandbox
> 'stderr' file.
> We believe the issue starts here, in Docker executor.cpp:
> {code}
> void shutdown(ExecutorDriver* driver)
> {
> cout << "Shutting down" << endl;
> if (run.isSome() && !killed) {
> // The docker daemon might still be in progress starting the
> // container, therefore we kill both the docker run process
> // and also ask the daemon to stop the container.
> // Making a mutable copy of the future so we can call discard.
> Future<Nothing>(run.get()).discard();
> stop = docker->stop(containerName, stopTimeout);
> killed = true;
> }
> }
> {code}
> Notice how the "run" future is discarded *before* the Docker daemon is told
> to stop -- now what will discarding it do?
> {code}
> void commandDiscarded(const Subprocess& s, const string& cmd)
> {
> VLOG(1) << "'" << cmd << "' is being discarded";
> os::killtree(s.pid(), SIGKILL);
> }
> {code}
> Oops, just sent SIGKILL to the entire process tree...
> You can see another (harmless?) side effect in the Docker daemon logs, it
> never gets a chance to kill the task:
> {noformat}
> ERROR Handler for DELETE
> /v1.22/containers/mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
> returned error: No such container:
> mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
> {noformat}
> I suspect that the fix is wait for 'docker->stop()' to complete before
> discarding the 'run' future.
> Happy to provide more information if necessary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)