[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186839#comment-15186839
 ] 

Martin Bydzovsky commented on MESOS-4279:
-----------------------------------------

Hi again [~qianzhang]. So i finally managed to solve the issue. There are 
actually 2 bugs in the docker/executor.cpp:

1) First is related to the my findings about the error {{attach: stdout: write 
unix @: broken pipe}} - which i mentioned 2 months ago in comment 
https://issues.apache.org/jira/browse/MESOS-4279?focusedCommentId=15091797&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15091797:
You are calling the {{run->discard}} method (which causes to close the 
stderr/stdout streams) too early - during the "stoping period" container can 
(and usually will) write something about the termination - but the stream is 
already invalid, so thats why the docker daemon complains - and thats why we 
haven't seen the message from my script (got 15... ending... Goodbye...) Here's 
commit that fixes it:
https://github.com/bydga/mesos/commit/73d1b3dc8605cf51163619c9e05c88666d926951


2) Second is about the actual wrong TASK_KILLED state:
Basically, you are just setting the {{killed=true}} flag always 
https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L227 - 
which is not true. Plus, 
https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L300 this 
check is also wrong - failed task is when "its not ready"? So here is commit 
that fixes it too:
https://github.com/bydga/mesos/commit/acf79781c04ad9309083dc39131e2c8305331431

I'm not a C++ developer, so the code might not the best - however it finally 
works.
So, what now: Do you want me to create one pull request that solves both? 2 
pull requests per each issue i mentioned? They are actually at the same 
code-lines, so 2 separate merge requests might lead to conflicts... Can you 
process this? I dont want to create new issue and go the same months-taking 
procedure again.

Btw, i have created the commits against the 0.26.0 tag - because we were 
talking about this version. Should i rebase it to master? Is there a chance it 
will get merged into current WIP 0.28?

Thanks for quick answers.

> Graceful restart of docker task
> -------------------------------
>
>                 Key: MESOS-4279
>                 URL: https://issues.apache.org/jira/browse/MESOS-4279
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, docker
>    Affects Versions: 0.25.0
>            Reporter: Martin Bydzovsky
>            Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
>     print "got %i" % _signo
>     print datetime.datetime.now().time()
>     sys.stdout.flush()
>     sleep(2)
>     print datetime.datetime.now().time()
>     print "ending"
>     sys.stdout.flush()
>     sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
>     print "Hello"
>     i = 0
>     while True:
>         i += 1
>         print datetime.datetime.now().time()
>         print "Iteration #%i" % i
>         sys.stdout.flush()
>         sleep(1)
> finally:
>     print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>       args: ["/tmp/script.py"],
>       instances: 1,
>       cpus: 0.1,
>       mem: 256,
>       id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>       args: ["./script.py"],
>       container: {
>               type: "DOCKER",
>               docker: {
>                       image: "bydga/marathon-test-api"
>               },
>               forcePullImage: yes
>       },
>       cpus: 0.1,
>       mem: 256,
>       instances: 1,
>       id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to