[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089364#comment-15089364 ]
Martin Bydzovsky commented on MESOS-4279: ----------------------------------------- Well I guess you introduced another "issue" in your test example. It's related to the way how you started the Marathon app. Please look at the explanation here: https://mesosphere.github.io/marathon/docs/native-docker.html#command-vs-args. In your {{ps}} output, you can see that the actual command is {{/bin/sh -c python /app/script.py}} - wrapped by sh -c. Seems like you started your Marathon app with something like: {code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", cmd: "python script.py", ...} {code} What I was showing in my examples above was: {code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", args: ["/tmp/script.py"], ...} {code} Usually this is called a "PID 1 problem" - https://medium.com/@gchudnov/trapping-signals-in-docker-containers-7a57fdda7d86#.zcxhq8yqn. Simply said, in your example the PID 1 inside the docker container is the shell process and the actual python script is pid 2. Default signal handlers for all processes EXCEPT pid 1 are to shutdown on SIGINT/SIGTERM. PID 1 default signal handlers just ignore them. So you could retry the example and use args instead of cmd. Then your {{ps}} output should look like: {code} root 10738 0.0 0.0 218228 14236 ? 15:22 0:00 docker run -c 102 -m 268435456 -e PORT_10002=31123 -e MARATHON_APP_VERSION=2016-01-08T15:22:49.646Z -e HOST=mesos-slave1.example.com -e MARATHON_APP_DOCKER_IMAGE=bydga/marathon-test-api -e MESOS_TASK_ID=marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b -e PORT=31123 -e PORTS=31123 -e MARATHON_APP_ID=/marathon-test-api -e PORT0=31123 -e MESOS_SANDBOX=/mnt/mesos/sandbox -v /srv/mesos/slaves/20160106-114735-3423223818-5050-1508-S3/frameworks/20160106-083626-1258962954-5050-9311-0000/executors/marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b/runs/bbeb80ab-e8d0-4b93-b7a0-6475787e090f:/mnt/mesos/sandbox --net host --name mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f bydga/marathon-test-api ./script.py root 10749 0.0 0.0 21576 4336 ? 15:22 0:00 /usr/bin/python ./script.py {code} With this setup, the docker stop works as expected: {code} bydzovskym mesos-slave1:aws ~ 🍺 docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES ed4a35e4372c bydga/marathon-test-api "./script.py" 7 minutes ago Up 7 minutes mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f bydzovskym mesos-slave1:aws ~ 🍺 time docker stop ed4a35e4372c ed4a35e4372c real 0m2.184s user 0m0.016s sys 0m0.042s {code} and the output of the dokcer: {code} bydzovskym mesos-slave1:aws ~ 🍺 docker logs -f ed4a35e4372c Hello 15:15:57.943294 Iteration #1 15:15:58.944470 Iteration #2 15:15:59.945631 Iteration #3 15:16:00.946794 got 15 15:16:40.473517 15:16:42.475655 ending Goodbye {code} The docker stop took a liiitle more than 2 seconds - as the grace period in the python script. I still guess the problem is somewhere in the mesos orchestrating the docker - either it sends wrong {{docker kill}} or it kills it even more painfully (killing the docker run with linux {{kill}} command... > Graceful restart of docker task > ------------------------------- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker > Affects Versions: 0.25.0 > Reporter: Martin Bydzovsky > Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)