[jira] [Commented] (MESOS-4279) Graceful restart of docker task

Qian Zhang (JIRA) Fri, 08 Jan 2016 06:58:02 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089301#comment-15089301
 ]


Qian Zhang commented on MESOS-4279:
-----------------------------------

When creating an app of Docker type in Marathon, the processes launched in 
Mesos agent is like:
{code}
root      2086  2063  0 Jan06 ?        00:00:49 docker -H 
unix:///var/run/docker.sock run -c 102 -m 33554432 -e 
MARATHON_APP_VERSION=2016-01-06T14:24:40.412Z -e HOST=mesos -e 
MARATHON_APP_DOCKER_IMAGE=mesos-4279 -e PORT_10000=31433 -e 
MESOS_TASK_ID=app-docker1.af64d5d2-b481-11e5-bdf1-0242497320ff -e PORT=31433 -e 
PORTS=31433 -e MARATHON_APP_ID=/app-docker1 -e PORT0=31433 -e 
MESOS_SANDBOX=/mnt/mesos/sandbox -e 
MESOS_CONTAINER_NAME=mesos-9ee670be-3c38-4c23-91c1-826b283dd283-S7.a919ce36-9b6e-4086-bfe8-9f0a34a3f471
 -v 
/tmp/mesos/slaves/9ee670be-3c38-4c23-91c1-826b283dd283-S7/frameworks/83ced7f5-69b3-409b-abe5-a582a5d278cd-0000/executors/app-docker1.af64d5d2-b481-11e5-bdf1-0242497320ff/runs/a919ce36-9b6e-4086-bfe8-9f0a34a3f471:/mnt/mesos/sandbox
 --net bridge --entrypoint /bin/sh --name 
mesos-9ee670be-3c38-4c23-91c1-826b283dd283-S7.a919ce36-9b6e-4086-bfe8-9f0a34a3f471
 mesos-4279 -c python /app/script.py
root      2124  2103  0 Jan06 ?        00:00:00 /bin/sh -c python /app/script.py
root      2140  2124  0 Jan06 ?        00:00:35 python /app/script.py
{code}

The first process (2086) is the "docker run" command launched by Mesos docker 
executor, and the second & third process (2124 & 2140) are the app processes 
launched by Docker daemon. When restarting the app in Marathon, the Mesos 
docker executor will kill the app processes first, the way that it does the 
"kill" is to run "docker stop" command 
(https://github.com/apache/mesos/blob/0.26.0/src/docker/executor.cpp#L218), and 
the "docker stop" command will ONLY send SIGTERM to the process 2124, but NOT 
to 2140 (the actual user script), that's why the signal handler in user script 
is not triggered.

However for the app which is not Docker type, when killing it, the executor 
will send SIGTERM to the process group 
(https://github.com/apache/mesos/blob/0.26.0/src/launcher/executor.cpp#L419), 
so the user script can get the signal too.

I am not sure if there is a way for "docker stop" to not only send SIGTERM to 
the parent process of user script process but also to the user script process 
itself ... 

> Graceful restart of docker task
> -------------------------------
>
>                 Key: MESOS-4279
>                 URL: https://issues.apache.org/jira/browse/MESOS-4279
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, docker
>    Affects Versions: 0.25.0
>            Reporter: Martin Bydzovsky
>            Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
>     print "got %i" % _signo
>     print datetime.datetime.now().time()
>     sys.stdout.flush()
>     sleep(2)
>     print datetime.datetime.now().time()
>     print "ending"
>     sys.stdout.flush()
>     sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
>     print "Hello"
>     i = 0
>     while True:
>         i += 1
>         print datetime.datetime.now().time()
>         print "Iteration #%i" % i
>         sys.stdout.flush()
>         sleep(1)
> finally:
>     print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>       args: ["/tmp/script.py"],
>       instances: 1,
>       cpus: 0.1,
>       mem: 256,
>       id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>       args: ["./script.py"],
>       container: {
>               type: "DOCKER",
>               docker: {
>                       image: "bydga/marathon-test-api"
>               },
>               forcePullImage: yes
>       },
>       cpus: 0.1,
>       mem: 256,
>       instances: 1,
>       id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4279) Graceful restart of docker task

Reply via email to