[jira] [Commented] (MESOS-4279) Graceful restart of docker task

Martin Bydzovsky (JIRA) Fri, 08 Jan 2016 07:41:59 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089364#comment-15089364
 ]


Martin Bydzovsky commented on MESOS-4279:
-----------------------------------------

Well I guess you introduced another "issue" in your test example. It's related 
to the way how you started the Marathon app. Please look at the explanation 
here: 
https://mesosphere.github.io/marathon/docs/native-docker.html#command-vs-args. 
In your {{ps}} output, you can see that the actual command is {{/bin/sh -c 
python /app/script.py}} - wrapped by sh -c.

Seems like you started your Marathon app with something like: 
{code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", cmd: 
"python script.py", ...} {code}
What I was showing in my examples above was: 
{code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", args: 
["/tmp/script.py"], ...} {code}

Usually this is called a "PID 1 problem" - 
https://medium.com/@gchudnov/trapping-signals-in-docker-containers-7a57fdda7d86#.zcxhq8yqn.

Simply said, in your example the PID 1 inside the docker container is the shell 
process and the actual python script is pid 2. Default signal handlers for all 
processes EXCEPT pid 1 are to shutdown on SIGINT/SIGTERM. PID 1 default signal 
handlers just ignore them.

So you could retry the example and use args instead of cmd. Then your {{ps}} 
output should look like:
{code}
root     10738  0.0  0.0 218228 14236 ?     15:22   0:00 docker run -c 102 -m 
268435456 -e PORT_10002=31123 -e MARATHON_APP_VERSION=2016-01-08T15:22:49.646Z 
-e HOST=mesos-slave1.example.com -e 
MARATHON_APP_DOCKER_IMAGE=bydga/marathon-test-api -e 
MESOS_TASK_ID=marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b -e 
PORT=31123 -e PORTS=31123 -e MARATHON_APP_ID=/marathon-test-api -e PORT0=31123 
-e MESOS_SANDBOX=/mnt/mesos/sandbox -v 
/srv/mesos/slaves/20160106-114735-3423223818-5050-1508-S3/frameworks/20160106-083626-1258962954-5050-9311-0000/executors/marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b/runs/bbeb80ab-e8d0-4b93-b7a0-6475787e090f:/mnt/mesos/sandbox
 --net host --name 
mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f
 bydga/marathon-test-api ./script.py
root     10749  0.0  0.0  21576  4336 ?     15:22   0:00 /usr/bin/python 
./script.py
{code}

With this setup, the docker stop works as expected:
{code}
bydzovskym mesos-slave1:aws ~ 🍺  docker ps
CONTAINER ID        IMAGE                                                 
COMMAND                  CREATED             STATUS              PORTS          
     NAMES
ed4a35e4372c        bydga/marathon-test-api                               
"./script.py"            7 minutes ago       Up 7 minutes                       
     
mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f

bydzovskym mesos-slave1:aws ~ 🍺  time docker stop ed4a35e4372c
ed4a35e4372c

real    0m2.184s
user    0m0.016s
sys     0m0.042s
{code}
and the output of the dokcer:
{code}
bydzovskym mesos-slave1:aws ~ 🍺  docker logs -f ed4a35e4372c
Hello
15:15:57.943294
Iteration #1
15:15:58.944470
Iteration #2
15:15:59.945631
Iteration #3
15:16:00.946794
got 15
15:16:40.473517
15:16:42.475655
ending
Goodbye
{code}

The docker stop took a liiitle more than 2 seconds - as the grace period in the 
python script.

I still guess the problem is somewhere in the mesos orchestrating the docker - 
either it sends wrong {{docker kill}} or it kills it even more painfully 
(killing the docker run with linux {{kill}} command...

> Graceful restart of docker task
> -------------------------------
>
>                 Key: MESOS-4279
>                 URL: https://issues.apache.org/jira/browse/MESOS-4279
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, docker
>    Affects Versions: 0.25.0
>            Reporter: Martin Bydzovsky
>            Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
>     print "got %i" % _signo
>     print datetime.datetime.now().time()
>     sys.stdout.flush()
>     sleep(2)
>     print datetime.datetime.now().time()
>     print "ending"
>     sys.stdout.flush()
>     sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
>     print "Hello"
>     i = 0
>     while True:
>         i += 1
>         print datetime.datetime.now().time()
>         print "Iteration #%i" % i
>         sys.stdout.flush()
>         sleep(1)
> finally:
>     print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>       args: ["/tmp/script.py"],
>       instances: 1,
>       cpus: 0.1,
>       mem: 256,
>       id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>       args: ["./script.py"],
>       container: {
>               type: "DOCKER",
>               docker: {
>                       image: "bydga/marathon-test-api"
>               },
>               forcePullImage: yes
>       },
>       cpus: 0.1,
>       mem: 256,
>       instances: 1,
>       id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4279) Graceful restart of docker task

Reply via email to