[ https://issues.apache.org/jira/browse/MESOS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Toenshoff reassigned MESOS-8762: ------------------------------------- Assignee: (was: Till Toenshoff) > Farmework Teardown Leaves Task in Uninterruptible Sleep State D > --------------------------------------------------------------- > > Key: MESOS-8762 > URL: https://issues.apache.org/jira/browse/MESOS-8762 > Project: Mesos > Issue Type: Bug > Reporter: Karsten > Priority: Major > Attachments: UpgradeIntegrationTest.zip, happy-process-sandbox.zip, > happy-process.trace, master_agent.log, zombi-process.trace, > zombie-process-sandbox.zip > > > The Marathon has a testsuite that starts a Python simple HTTP server in a > task group aka pod in Marathon wit ha persistent volume. After the test run > we call {{/teardown}} and wait for the Marathon framework to be completed > (see > [MesosTest|https://github.com/mesosphere/marathon/blob/master/src/test/scala/mesosphere/marathon/integration/setup/MesosTest.scala#L311]). > > Our CI checks whether we leak any tasks after all test runs. It turns out we > do: > {code} > Will kill: > root 18084 0.0 0.0 45380 13612 ? D 07:52 0:00 python > src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z > http://www.example.com > Running 'sudo kill -9 18084 > Wait for processes being killed... > ... > Couldn't kill some leaked processes: > root 18084 0.0 0.0 45380 13612 ? D 07:52 0:00 python > src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z > http://www.example.com > ammonite.$file.ci.utils$StageException: Stage Compile and Test failed. > {code} > The attached Mesos master and agents logs (see attachment) show > {code} > Updating the state of task > resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 > of framework 3c1bf149-6b68-469c-beb9-9910f386fd5a-0000 (latest state: > TASK_KILLED, status update state: TASK_KILLED) > {code} > The executor logs (see zipped sandbox attached) shows > {code} > I0406 07:52:39.925599 18078 default_executor.cpp:191] Received SHUTDOWN event > I0406 07:52:39.925624 18078 default_executor.cpp:962] Shutting down > I0406 07:52:39.925634 18078 default_executor.cpp:1058] Killing task > resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 > running in child container > cfce1f46-f565-4f45-aa0f-e6e6f6c5434b.0cc9c336-f0b4-448d-a8b4-dea33c02f0ae > with SIGTERM signal > I0406 07:52:39.925647 18078 default_executor.cpp:1080] Scheduling escalation > to SIGKILL in 3secs from now > {code} > The task logs > {code} > 2018-04-06 07:52:36,620 INFO : 2.7.13 > 2018-04-06 07:52:36,621 DEBUG : ['src/app_mock.py', '35477', > 'resident-pod-16322-fail', '2018-04-06T07:52:16.924Z', > 'http://www.example.com'] > 2018-04-06 07:52:36,621 INFO : AppMock[resident-pod-16322-fail > 2018-04-06T07:52:16.924Z]: > resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 > has taken the stage at port 35477. Will query http://www.example.com for > health and readiness status. > 2018-04-06 07:52:38,895 DEBUG : Got GET request > 172.16.10.198 - - [06/Apr/2018 07:52:38] "GET /pst1/foo HTTP/1.1" 200 - > {code} > Compare these to a single task without a persistent volume of the same run > {code} > I0406 07:52:39.925590 15585 exec.cpp:445] Executor asked to shutdown > I0406 07:52:39.925698 15585 executor.cpp:171] Received SHUTDOWN event > I0406 07:52:39.925717 15585 executor.cpp:748] Shutting down > I0406 07:52:39.925729 15585 executor.cpp:863] Sending SIGTERM to process tree > at pid 15591 > I0406 07:52:40.029878 15585 executor.cpp:876] Sent SIGTERM to the following > process trees: > [ > -+- 15591 sh -c echo APP PROXY $MESOS_TASK_ID RUNNING; > /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py > $PORT0 /app-156 v1 http://$HOST:0/%2Fapp-156/v1 > \--- 15612 python > /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py > 35417 /app-156 v1 http://172.16.10.198:0/%2Fapp-156/v1 > ] > I0406 07:52:40.029912 15585 executor.cpp:880] Scheduling escalation to > SIGKILL in 3secs from now > 2018-04-06 07:52:40,029 WARNING : Received 15 signal. Closing the server... > 2018-04-06 07:52:40,030 ERROR : Socket.error in the main thread: > Traceback (most recent call last): > File > "/home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py", > line 151, in <module> > httpd.serve_forever() > File "/usr/lib/python2.7/SocketServer.py", line 231, in serve_forever > poll_interval) > File "/usr/lib/python2.7/SocketServer.py", line 150, in _eintr_retry > return func(*args) > File "/usr/lib/python2.7/SocketServer.py", line 456, in fileno > return self.socket.fileno() > File "/usr/lib/python2.7/socket.py", line 228, in meth > return getattr(self._sock,name)(*args) > File "/usr/lib/python2.7/socket.py", line 174, in _dummy > raise error(EBADF, 'Bad file descriptor') > error: [Errno 9] Bad file descriptor > 2018-04-06 07:52:40,030 INFO : Closing the server... > {code} > The server receives the {{SIGTERM}} and shuts down. -- This message was sent by Atlassian JIRA (v7.6.3#76005)