[
https://issues.apache.org/jira/browse/MESOS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445755#comment-16445755
]
Karsten edited comment on MESOS-8762 at 4/20/18 1:59 PM:
---------------------------------------------------------
Take a look at this build
https://jenkins.mesosphere.com/service/jenkins/view/Marathon/job/marathon-sandbox/job/marathon-loop-karsten-slim/142/.
First check the console output. You'll see two processes in state D
{code}
root 12440 0.2 0.0 45380 13488 ? D 02:48 0:00 python
src/app_mock.py 33801 resident-pod-16322 2018-04-20T02:48:35.397Z
http://www.example.com
root 12601 0.5 0.0 45380 13352 ? D 02:48 0:00 python
src/app_mock.py 33793 resident-pod-16322-fail 2018-04-20T02:48:37.678Z
http://www.example.com
{code}
Now, download the logs and search for {{Process tree before teardown}}. You'll
find all process before we trigger the framework treardown on Mesos. You'll
find PIDs {{12440}} and {{12440}} as children of {{systemd 1}} *not* as
children of {{mesos-agent,11783}} as I would expect. The same is true for the
tree from {{ps auxf}} in the console logs again. These are *after* all tests
ran.
was (Author: jeschkies):
Take a look at this build
https://jenkins.mesosphere.com/service/jenkins/view/Marathon/job/marathon-sandbox/job/marathon-loop-karsten-slim/142/.
First check the console output. You'll see two processes in state D
{code}
root 12440 0.2 0.0 45380 13488 ? D 02:48 0:00 python
src/app_mock.py 33801 resident-pod-16322 2018-04-20T02:48:35.397Z
http://www.example.com
root 12601 0.5 0.0 45380 13352 ? D 02:48 0:00 python
src/app_mock.py 33793 resident-pod-16322-fail 2018-04-20T02:48:37.678Z
http://www.example.com```
{code}
Now, download the logs and search for {Process tree before teardown}. You'll
find all process before we trigger the framework treardown on Mesos. You'll
find PIDs {12440} and {12440} as children of {systemd 1} *not* as children of
{mesos-agent,11783} as I would expect. The same is true for the tree from {ps
auxf} in the console logs again. These are *after* all tests ran.
> Farmework Teardown Leaves Task in Uninterruptible Sleep State D
> ---------------------------------------------------------------
>
> Key: MESOS-8762
> URL: https://issues.apache.org/jira/browse/MESOS-8762
> Project: Mesos
> Issue Type: Bug
> Reporter: Karsten
> Assignee: Till Toenshoff
> Priority: Major
> Attachments: UpgradeIntegrationTest.zip, happy-process-sandbox.zip,
> happy-process.trace, master_agent.log, zombi-process.trace,
> zombie-process-sandbox.zip
>
>
> The Marathon has a testsuite that starts a Python simple HTTP server in a
> task group aka pod in Marathon wit ha persistent volume. After the test run
> we call {{/teardown}} and wait for the Marathon framework to be completed
> (see
> [MesosTest|https://github.com/mesosphere/marathon/blob/master/src/test/scala/mesosphere/marathon/integration/setup/MesosTest.scala#L311]).
>
> Our CI checks whether we leak any tasks after all test runs. It turns out we
> do:
> {code}
> Will kill:
> root 18084 0.0 0.0 45380 13612 ? D 07:52 0:00 python
> src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z
> http://www.example.com
> Running 'sudo kill -9 18084
> Wait for processes being killed...
> ...
> Couldn't kill some leaked processes:
> root 18084 0.0 0.0 45380 13612 ? D 07:52 0:00 python
> src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z
> http://www.example.com
> ammonite.$file.ci.utils$StageException: Stage Compile and Test failed.
> {code}
> The attached Mesos master and agents logs (see attachment) show
> {code}
> Updating the state of task
> resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2
> of framework 3c1bf149-6b68-469c-beb9-9910f386fd5a-0000 (latest state:
> TASK_KILLED, status update state: TASK_KILLED)
> {code}
> The executor logs (see zipped sandbox attached) shows
> {code}
> I0406 07:52:39.925599 18078 default_executor.cpp:191] Received SHUTDOWN event
> I0406 07:52:39.925624 18078 default_executor.cpp:962] Shutting down
> I0406 07:52:39.925634 18078 default_executor.cpp:1058] Killing task
> resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2
> running in child container
> cfce1f46-f565-4f45-aa0f-e6e6f6c5434b.0cc9c336-f0b4-448d-a8b4-dea33c02f0ae
> with SIGTERM signal
> I0406 07:52:39.925647 18078 default_executor.cpp:1080] Scheduling escalation
> to SIGKILL in 3secs from now
> {code}
> The task logs
> {code}
> 2018-04-06 07:52:36,620 INFO : 2.7.13
> 2018-04-06 07:52:36,621 DEBUG : ['src/app_mock.py', '35477',
> 'resident-pod-16322-fail', '2018-04-06T07:52:16.924Z',
> 'http://www.example.com']
> 2018-04-06 07:52:36,621 INFO : AppMock[resident-pod-16322-fail
> 2018-04-06T07:52:16.924Z]:
> resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2
> has taken the stage at port 35477. Will query http://www.example.com for
> health and readiness status.
> 2018-04-06 07:52:38,895 DEBUG : Got GET request
> 172.16.10.198 - - [06/Apr/2018 07:52:38] "GET /pst1/foo HTTP/1.1" 200 -
> {code}
> Compare these to a single task without a persistent volume of the same run
> {code}
> I0406 07:52:39.925590 15585 exec.cpp:445] Executor asked to shutdown
> I0406 07:52:39.925698 15585 executor.cpp:171] Received SHUTDOWN event
> I0406 07:52:39.925717 15585 executor.cpp:748] Shutting down
> I0406 07:52:39.925729 15585 executor.cpp:863] Sending SIGTERM to process tree
> at pid 15591
> I0406 07:52:40.029878 15585 executor.cpp:876] Sent SIGTERM to the following
> process trees:
> [
> -+- 15591 sh -c echo APP PROXY $MESOS_TASK_ID RUNNING;
> /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py
> $PORT0 /app-156 v1 http://$HOST:0/%2Fapp-156/v1
> \--- 15612 python
> /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py
> 35417 /app-156 v1 http://172.16.10.198:0/%2Fapp-156/v1
> ]
> I0406 07:52:40.029912 15585 executor.cpp:880] Scheduling escalation to
> SIGKILL in 3secs from now
> 2018-04-06 07:52:40,029 WARNING : Received 15 signal. Closing the server...
> 2018-04-06 07:52:40,030 ERROR : Socket.error in the main thread:
> Traceback (most recent call last):
> File
> "/home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py",
> line 151, in <module>
> httpd.serve_forever()
> File "/usr/lib/python2.7/SocketServer.py", line 231, in serve_forever
> poll_interval)
> File "/usr/lib/python2.7/SocketServer.py", line 150, in _eintr_retry
> return func(*args)
> File "/usr/lib/python2.7/SocketServer.py", line 456, in fileno
> return self.socket.fileno()
> File "/usr/lib/python2.7/socket.py", line 228, in meth
> return getattr(self._sock,name)(*args)
> File "/usr/lib/python2.7/socket.py", line 174, in _dummy
> raise error(EBADF, 'Bad file descriptor')
> error: [Errno 9] Bad file descriptor
> 2018-04-06 07:52:40,030 INFO : Closing the server...
> {code}
> The server receives the {{SIGTERM}} and shuts down.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)