----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65713/#review197841 -----------------------------------------------------------
src/docker/executor.cpp Lines 436-437 (original), 436-437 (patched) <https://reviews.apache.org/r/65713/#comment278129> This comment is not strictly corrected: `run.isSome() == true` only means that we have launched a docker cli `run` command and nothing more. Can you please update the comment? And maybe also leave a note how `run` and `inspect` relate to task states where these fields are declared? Thanks! As a side note (no call to action), a proper state machine with exlicit states would be great to have here. src/docker/executor.cpp Lines 451-455 (patched) <https://reviews.apache.org/r/65713/#comment278132> So instead of checking that inspect **always** finishes after `DOCKER_INSPECT_TIMEOUT`, we only check that if we want to kill the container, either via kill or reap. I can see the reason for this (we may want to give `docker inspect` enough time to crank up its whale machinery if the user gives us this time), but maybe still print a warning? For example, hang the following at the inspect call site: ``` inspect .after(DOCKER_INSPECT_TIMEOUT, [=](const Future<Nothing>& f) { LOG(WARNING) << "docker inspect has not finished after " << DOCKER_INSPECT_TIMEOUT; return f; }); src/docker/executor.cpp Line 509 (original), 518-519 (patched) <https://reviews.apache.org/r/65713/#comment278130> "Docker stop timed out after " << KILL_RETRY_INTERVAL; Also no need to specify the container name if you follow the suggestion below. src/docker/executor.cpp Lines 523-527 (patched) <https://reviews.apache.org/r/65713/#comment278131> Instead of this, why not make `stop.onFailed` below `stop.recover`? - Alexander Rukletsov On Feb. 20, 2018, 5:11 p.m., Andrei Budnik wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65713/ > ----------------------------------------------------------- > > (Updated Feb. 20, 2018, 5:11 p.m.) > > > Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and > Vinod Kone. > > > Bugs: MESOS-8574 > https://issues.apache.org/jira/browse/MESOS-8574 > > > Repository: mesos > > > Description > ------- > > Previosly, if `docker inspect` command hanged, the docker container > ended up in an unkillable state. This patch adds a timeout for inspect > command after receiving `killTask` analogically to `reaped` handler. > In addition we've added a timeout for `docker stop` command. If docker > `stop` or `inspect` command times out, we discard the related future, > thus the docker library kills previously spawned docker cli subprocess. > As a result, a scheduler can retry `killTask` operation to handle > nasty docker bugs that lead to hanging docker cli. > > > Diffs > ----- > > src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc > > > Diff: https://reviews.apache.org/r/65713/diff/1/ > > > Testing > ------- > > internal CI > > Manual testing: > 1. Build docker from sources: > http://oyvindsk.com/writing/docker-build-from-source > 2. Modify `ContainerInspect` function from `docker/inspect.go`: > ``` > func (daemon *Daemon) ContainerInspect(name string, size bool, version > string) (interface{}, error) { > + time.Sleep(10 * time.Second) > ``` > 3. Modify `ContainerStop` function from `docker/stop.go`: > ``` > func (daemon *Daemon) ContainerStop(name string, seconds *int) error { > + rand.Seed(time.Now().UTC().UnixNano()) > + if rand.Intn(2) == 0 { > + time.Sleep(20 * time.Second) > + } > ``` > 4. Rebuild docker: `sudo make build && sudo make binary` > 5. Stop system docker daemon: `sudo service docker stop` > 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev` > 7. Modify `src/cli/execute.cpp`: > a) Add `delay(Seconds(15), self(), &Self::retryKill, task->task_id(), > offer.agent_id());` after > https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606 > b) Add a new method `retryKill` to `CommandScheduler`: > ``` > void retryKill(const TaskID& taskId, const AgentID& agentId) > { > killTask(taskId, agentId); > delay(Seconds(6), self(), &Self::retryKill, taskId, agentId); > } > ``` > 8. Rebuild mesos > 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'` > 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh > --resources="cpus:10000;mem:1000000" > --work_dir='/home/abudnik/mesos/build/var/agent-1' > --containerizers="docker,mesos" --master="127.0.1.1:5050"` > 11. Submit a task for the docker executor: `./src/mesos-execute > --master="127.0.1.1:5050" --name="a" --containerizer=docker > --docker_image="ubuntu:xenial" --command="sleep 9999"` > > > Thanks, > > Andrei Budnik > >
