----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/44571/#review127685 -----------------------------------------------------------
Fix it, then Ship it! src/slave/containerizer/docker.cpp (line 1854) <https://reviews.apache.org/r/44571/#comment191073> The 1 seconds grace period here is pretty random. Can you make it a constant? src/slave/containerizer/docker.cpp (line 1954) <https://reviews.apache.org/r/44571/#comment191074> This is useless since docker->stop does not handle `discard()` properly. We still rely on the subprocess to terminate anyway. To be clear, `discard()` does not necessarily mean that the future will be in DISCARDED state. It's up to the owner to decide (calling `promise->discard()`). - Jie Yu On April 4, 2016, 2:05 p.m., Jan Schlicht wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/44571/ > ----------------------------------------------------------- > > (Updated April 4, 2016, 2:05 p.m.) > > > Review request for mesos, Jie Yu and Joris Van Remoortere. > > > Bugs: MESOS-4673 > https://issues.apache.org/jira/browse/MESOS-4673 > > > Repository: mesos > > > Description > ------- > > Commands issued to the Docker daemon can hang, causing problems within Mesos. > For example a hanging 'docker stop' can result in an unresponsive executor, > causing the Mesos agent to issue an to run a 'docker stop' itself which might > result in an unresponsive agent (see MESOS-4673). > Adding a timeout can be used as a workaround. > > > Diffs > ----- > > src/slave/containerizer/docker.hpp 89d450e10a84f24ddd46d517e2b4b46ab02c4fda > src/slave/containerizer/docker.cpp 9314d1f9e0b6077fe7c48b860783ab21acc48be6 > > Diff: https://reviews.apache.org/r/44571/diff/ > > > Testing > ------- > > sudo ./bin/mesos-tests.sh (to test if existing tests break due to the changed > behavior) > > Because docker must hang for both the Mesos agent as well as the > `mesos-docker-executor`, it can't currently be tested as part of the Mesos > integration tests. Here's how to test that the timeout works: > Run with Fedora 23 (Kernel 4.2.3, Docker 1.9.1) > # Start a master > ./bin/mesos-master.sh --work_dir=/tmp/mesos & > > # Start an agent > sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --containerizers=docker & > > # Run a task using the docker containerizer > ./src/mesos-execute --containerizer=docker --docker_image=alpine > --master=127.0.0.1:5050 --name="sleep" --command="sleep 1000" & > # Note the pid of `mesos-execute` as well as the pid of the sleep task run by > docker (eg 3323 and 3474) > > # Have mesos run `docker inspect` to gather the pid of the docker task > curl -X GET localhost:5051/monitor/statistics > > # Now overload docker by trying to run a lot of tasks in parallel > for i in `seq 1 100`; do sudo docker run --rm alpine sleep 60 & done > > # Wait until the first of these docker tasks finish, `sudo docker ps` should > be unresponsible now > # Kill the `mesos-execute` task (eg 3323) > kill 3323 > > # Watch the logs of the Mesos agent. At some point it will send a SIGKILL to > the docker task (eg 3474) > # Make sure that the docker task is indeed termintad (using `ps fax` or the > like) > > > Thanks, > > Jan Schlicht > >
