[
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089364#comment-15089364
]
Martin Bydzovsky commented on MESOS-4279:
-----------------------------------------
Well I guess you introduced another "issue" in your test example. It's related
to the way how you started the Marathon app. Please look at the explanation
here:
https://mesosphere.github.io/marathon/docs/native-docker.html#command-vs-args.
In your {{ps}} output, you can see that the actual command is {{/bin/sh -c
python /app/script.py}} - wrapped by sh -c.
Seems like you started your Marathon app with something like:
{code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", cmd:
"python script.py", ...} {code}
What I was showing in my examples above was:
{code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", args:
["/tmp/script.py"], ...} {code}
Usually this is called a "PID 1 problem" -
https://medium.com/@gchudnov/trapping-signals-in-docker-containers-7a57fdda7d86#.zcxhq8yqn.
Simply said, in your example the PID 1 inside the docker container is the shell
process and the actual python script is pid 2. Default signal handlers for all
processes EXCEPT pid 1 are to shutdown on SIGINT/SIGTERM. PID 1 default signal
handlers just ignore them.
So you could retry the example and use args instead of cmd. Then your {{ps}}
output should look like:
{code}
root 10738 0.0 0.0 218228 14236 ? 15:22 0:00 docker run -c 102 -m
268435456 -e PORT_10002=31123 -e MARATHON_APP_VERSION=2016-01-08T15:22:49.646Z
-e HOST=mesos-slave1.example.com -e
MARATHON_APP_DOCKER_IMAGE=bydga/marathon-test-api -e
MESOS_TASK_ID=marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b -e
PORT=31123 -e PORTS=31123 -e MARATHON_APP_ID=/marathon-test-api -e PORT0=31123
-e MESOS_SANDBOX=/mnt/mesos/sandbox -v
/srv/mesos/slaves/20160106-114735-3423223818-5050-1508-S3/frameworks/20160106-083626-1258962954-5050-9311-0000/executors/marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b/runs/bbeb80ab-e8d0-4b93-b7a0-6475787e090f:/mnt/mesos/sandbox
--net host --name
mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f
bydga/marathon-test-api ./script.py
root 10749 0.0 0.0 21576 4336 ? 15:22 0:00 /usr/bin/python
./script.py
{code}
With this setup, the docker stop works as expected:
{code}
bydzovskym mesos-slave1:aws ~ 🍺 docker ps
CONTAINER ID IMAGE
COMMAND CREATED STATUS PORTS
NAMES
ed4a35e4372c bydga/marathon-test-api
"./script.py" 7 minutes ago Up 7 minutes
mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f
bydzovskym mesos-slave1:aws ~ 🍺 time docker stop ed4a35e4372c
ed4a35e4372c
real 0m2.184s
user 0m0.016s
sys 0m0.042s
{code}
and the output of the dokcer:
{code}
bydzovskym mesos-slave1:aws ~ 🍺 docker logs -f ed4a35e4372c
Hello
15:15:57.943294
Iteration #1
15:15:58.944470
Iteration #2
15:15:59.945631
Iteration #3
15:16:00.946794
got 15
15:16:40.473517
15:16:42.475655
ending
Goodbye
{code}
The docker stop took a liiitle more than 2 seconds - as the grace period in the
python script.
I still guess the problem is somewhere in the mesos orchestrating the docker -
either it sends wrong {{docker kill}} or it kills it even more painfully
(killing the docker run with linux {{kill}} command...
> Graceful restart of docker task
> -------------------------------
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
> Issue Type: Bug
> Components: containerization, docker
> Affects Versions: 0.25.0
> Reporter: Martin Bydzovsky
> Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I
> came to a following issue:
> (it was already discussed on
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
> args: ["/tmp/script.py"],
> instances: 1,
> cpus: 0.1,
> mem: 256,
> id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
> args: ["./script.py"],
> container: {
> type: "DOCKER",
> docker: {
> image: "bydga/marathon-test-api"
> },
> forcePullImage: yes
> },
> cpus: 0.1,
> mem: 256,
> instances: 1,
> id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without
> having a chance to do any cleanup.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)