[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089301#comment-15089301 ]
Qian Zhang commented on MESOS-4279: ----------------------------------- When creating an app of Docker type in Marathon, the processes launched in Mesos agent is like: {code} root 2086 2063 0 Jan06 ? 00:00:49 docker -H unix:///var/run/docker.sock run -c 102 -m 33554432 -e MARATHON_APP_VERSION=2016-01-06T14:24:40.412Z -e HOST=mesos -e MARATHON_APP_DOCKER_IMAGE=mesos-4279 -e PORT_10000=31433 -e MESOS_TASK_ID=app-docker1.af64d5d2-b481-11e5-bdf1-0242497320ff -e PORT=31433 -e PORTS=31433 -e MARATHON_APP_ID=/app-docker1 -e PORT0=31433 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-9ee670be-3c38-4c23-91c1-826b283dd283-S7.a919ce36-9b6e-4086-bfe8-9f0a34a3f471 -v /tmp/mesos/slaves/9ee670be-3c38-4c23-91c1-826b283dd283-S7/frameworks/83ced7f5-69b3-409b-abe5-a582a5d278cd-0000/executors/app-docker1.af64d5d2-b481-11e5-bdf1-0242497320ff/runs/a919ce36-9b6e-4086-bfe8-9f0a34a3f471:/mnt/mesos/sandbox --net bridge --entrypoint /bin/sh --name mesos-9ee670be-3c38-4c23-91c1-826b283dd283-S7.a919ce36-9b6e-4086-bfe8-9f0a34a3f471 mesos-4279 -c python /app/script.py root 2124 2103 0 Jan06 ? 00:00:00 /bin/sh -c python /app/script.py root 2140 2124 0 Jan06 ? 00:00:35 python /app/script.py {code} The first process (2086) is the "docker run" command launched by Mesos docker executor, and the second & third process (2124 & 2140) are the app processes launched by Docker daemon. When restarting the app in Marathon, the Mesos docker executor will kill the app processes first, the way that it does the "kill" is to run "docker stop" command (https://github.com/apache/mesos/blob/0.26.0/src/docker/executor.cpp#L218), and the "docker stop" command will ONLY send SIGTERM to the process 2124, but NOT to 2140 (the actual user script), that's why the signal handler in user script is not triggered. However for the app which is not Docker type, when killing it, the executor will send SIGTERM to the process group (https://github.com/apache/mesos/blob/0.26.0/src/launcher/executor.cpp#L419), so the user script can get the signal too. I am not sure if there is a way for "docker stop" to not only send SIGTERM to the parent process of user script process but also to the user script process itself ... > Graceful restart of docker task > ------------------------------- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker > Affects Versions: 0.25.0 > Reporter: Martin Bydzovsky > Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)