> On Aug. 10, 2017, 12:07 p.m., Alexander Rukletsov wrote: > > src/docker/executor.cpp > > Lines 410-415 (original), 416-421 (patched) > > <https://reviews.apache.org/r/61530/diff/1/?file=1794166#file1794166line416> > > > > Let's add a comment explaining why are we doing retry / unblock on > > failure and why not timeout. Something like: > > ``` > > // Invoking `docker stop` might be unsuccessful, in which case the > > container most probably does not receive the signal. In this case we should > > allow schedulers to retry the kill operation or, if the kill was initiated > > by a failing health check, retry ourselves. We do not bail out nor stop > > retrying to avoid sending a terminal status update while the container > > might still be running. > > // > > // NOTE: `docker stop` might also hang. We do not address this for now, > > because there is no evidence that in this case docker daemon might funciton > > properly, i.e., it is only the docker cli command that hangs, and hence > > there is not so much we can do. > > ```
We should also refer to https://issues.apache.org/jira/browse/MESOS-6743 in the comment so that folks can get more context. - Alexander ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61530/#review182573 ----------------------------------------------------------- On Aug. 9, 2017, 4:55 p.m., Andrei Budnik wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61530/ > ----------------------------------------------------------- > > (Updated Aug. 9, 2017, 4:55 p.m.) > > > Review request for mesos and Alexander Rukletsov. > > > Bugs: MESOS-6743 > https://issues.apache.org/jira/browse/MESOS-6743 > > > Repository: mesos > > > Description > ------- > > Previously, after `docker stop` command failure, docker executor > neither allowed a scheduler to retry `killTask` command, nor retried to > kill a task when task killing was triggered by a failed health check. > > > Diffs > ----- > > src/docker/executor.cpp 26f12ec002f754fab0d34c01472cf95b499d8007 > > > Diff: https://reviews.apache.org/r/61530/diff/1/ > > > Testing > ------- > > sudo make check > > Manual testing: > > Emulating `docker stop` errors: > =============================== > 1. Add `return fmt.Errorf("Emulating error!")` to > https://github.com/docker/docker-ce/blob/master/components/engine/daemon/stop.go#L21 > 2. build docker from sources: > http://oyvindsk.com/writing/docker-build-from-source > 3. stop docker service and launch dockerd binary, like: `sudo > ./bundles/17.06.0-dev/binary-daemon/dockerd` > > Emulating docker daemon hang: > ============================= > 1. `ps aux|grep dockerd` - 2 processes will be found > 2. `sudo kill -STOP <PID1> <PID2>` - send SIGSTOP to docker daemon processes > just before sending `docker stop` > > Emulating health check failure in docker executor: > ================================================== > 1. Add > ```c++ > static int fake = 0; > if (++fake > 10) { > failure(); > return; > } > ``` > to `HealthChecker::processCheckResult()` in `src/checks/health_checker.cpp` > 2. Add > ```c++ > HealthCheck healthCheck; > healthCheck.set_type(HealthCheck::COMMAND); > healthCheck.mutable_command()->set_value("exit 0"); > healthCheck.set_delay_seconds(0); > healthCheck.set_interval_seconds(0); > healthCheck.set_grace_period_seconds(1); > _task.mutable_health_check()->CopyFrom(healthCheck); > ``` > to `CommandScheduler::offers()` in `src/cli/execute.cpp` > 3. compile mesos > 4. run mesos agent: `sudo GLOG_v=1 ./bin/mesos-agent.sh > --resources="cpus:10000;mem:1000000" > --work_dir='/home/some_user/mesos/build/var/agent-1' > --containerizers="docker,mesos" --master="127.0.1.1:5050"` > 5. launch docker executor: `./src/mesos-execute --master="127.0.1.1:5050" > --name="a" --containerizer=docker --docker_image="ubuntu:xenial" > --command="while true; do : ; done"` > > > Thanks, > > Andrei Budnik > >