[ 
https://issues.apache.org/jira/browse/MESOS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437270#comment-16437270
 ] 

Karsten commented on MESOS-8762:
--------------------------------

Alright, I was able to strace the leaking process. See {{zombie-process.trace}} 
for the strace output and {{zombie-process-sandbox.zip}} for the Mesos sandbox 
of the test run. 

> Farmework Teardown Leaves Task in Uninterruptible Sleep State D
> ---------------------------------------------------------------
>
>                 Key: MESOS-8762
>                 URL: https://issues.apache.org/jira/browse/MESOS-8762
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Karsten
>            Assignee: Till Toenshoff
>            Priority: Major
>         Attachments: UpgradeIntegrationTest.zip, happy-process-sandbox.zip, 
> happy-process.trace, master_agent.log, zombi-process.trace, 
> zombie-process-sandbox.zip
>
>
> The Marathon has a testsuite that starts a Python simple HTTP server in a 
> task group aka pod in Marathon wit ha persistent volume. After the test run 
> we call {{/teardown}} and wait for the Marathon framework to be completed 
> (see 
> [MesosTest|https://github.com/mesosphere/marathon/blob/master/src/test/scala/mesosphere/marathon/integration/setup/MesosTest.scala#L311]).
>  
> Our CI checks whether we leak any tasks after all test runs. It turns out we 
> do:
> {code}
> Will kill:
>   root     18084  0.0  0.0  45380 13612 ?        D    07:52   0:00 python 
> src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z 
> http://www.example.com
> Running 'sudo kill -9 18084
> Wait for processes being killed...
> ...
> Couldn't kill some leaked processes:
>   root     18084  0.0  0.0  45380 13612 ?        D    07:52   0:00 python 
> src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z 
> http://www.example.com
> ammonite.$file.ci.utils$StageException: Stage Compile and Test failed.
> {code}
> The attached Mesos master and agents logs (see attachment) show
> {code}
> Updating the state of task 
> resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 
> of framework 3c1bf149-6b68-469c-beb9-9910f386fd5a-0000 (latest state: 
> TASK_KILLED, status update state: TASK_KILLED)
> {code}
> The executor logs (see zipped sandbox attached) shows
> {code}
> I0406 07:52:39.925599 18078 default_executor.cpp:191] Received SHUTDOWN event
> I0406 07:52:39.925624 18078 default_executor.cpp:962] Shutting down
> I0406 07:52:39.925634 18078 default_executor.cpp:1058] Killing task 
> resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 
> running in child container 
> cfce1f46-f565-4f45-aa0f-e6e6f6c5434b.0cc9c336-f0b4-448d-a8b4-dea33c02f0ae 
> with SIGTERM signal
> I0406 07:52:39.925647 18078 default_executor.cpp:1080] Scheduling escalation 
> to SIGKILL in 3secs from now
> {code}
> The task logs
> {code}
> 2018-04-06 07:52:36,620 INFO    : 2.7.13
> 2018-04-06 07:52:36,621 DEBUG   : ['src/app_mock.py', '35477', 
> 'resident-pod-16322-fail', '2018-04-06T07:52:16.924Z', 
> 'http://www.example.com']
> 2018-04-06 07:52:36,621 INFO    : AppMock[resident-pod-16322-fail 
> 2018-04-06T07:52:16.924Z]: 
> resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 
> has taken the stage at port 35477. Will query http://www.example.com for 
> health and readiness status.
> 2018-04-06 07:52:38,895 DEBUG   : Got GET request
> 172.16.10.198 - - [06/Apr/2018 07:52:38] "GET /pst1/foo HTTP/1.1" 200 -
> {code}
> Compare these to a single task without a persistent volume of the same run
> {code}
> I0406 07:52:39.925590 15585 exec.cpp:445] Executor asked to shutdown
> I0406 07:52:39.925698 15585 executor.cpp:171] Received SHUTDOWN event
> I0406 07:52:39.925717 15585 executor.cpp:748] Shutting down
> I0406 07:52:39.925729 15585 executor.cpp:863] Sending SIGTERM to process tree 
> at pid 15591
> I0406 07:52:40.029878 15585 executor.cpp:876] Sent SIGTERM to the following 
> process trees:
> [ 
> -+- 15591 sh -c echo APP PROXY $MESOS_TASK_ID RUNNING; 
> /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py
>  $PORT0 /app-156 v1 http://$HOST:0/%2Fapp-156/v1 
>  \--- 15612 python 
> /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py
>  35417 /app-156 v1 http://172.16.10.198:0/%2Fapp-156/v1 
> ]
> I0406 07:52:40.029912 15585 executor.cpp:880] Scheduling escalation to 
> SIGKILL in 3secs from now
> 2018-04-06 07:52:40,029 WARNING : Received 15 signal. Closing the server...
> 2018-04-06 07:52:40,030 ERROR   : Socket.error in the main thread: 
> Traceback (most recent call last):
>   File 
> "/home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py",
>  line 151, in <module>
>     httpd.serve_forever()
>   File "/usr/lib/python2.7/SocketServer.py", line 231, in serve_forever
>     poll_interval)
>   File "/usr/lib/python2.7/SocketServer.py", line 150, in _eintr_retry
>     return func(*args)
>   File "/usr/lib/python2.7/SocketServer.py", line 456, in fileno
>     return self.socket.fileno()
>   File "/usr/lib/python2.7/socket.py", line 228, in meth
>     return getattr(self._sock,name)(*args)
>   File "/usr/lib/python2.7/socket.py", line 174, in _dummy
>     raise error(EBADF, 'Bad file descriptor')
> error: [Errno 9] Bad file descriptor
> 2018-04-06 07:52:40,030 INFO    : Closing the server...
> {code}
> The server receives the {{SIGTERM}} and shuts down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to