[jira] [Commented] (MESOS-8051) Killing TASK_GROUP fail to kill some tasks

Qian Zhang (JIRA) Sun, 08 Oct 2017 05:51:31 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16196097#comment-16196097
 ]


Qian Zhang commented on MESOS-8051:
-----------------------------------

I can reproduce this issue with the latest Mesos code (master branch) + 
Marathon v1.5.1. I created the same task group (pod) as [~zen-dog]'s, and when 
it is running, run the following command to delete it:
{code}
curl -X DELETE http://192.168.1.3:8080/v2/pods/simple-pod
{code}

Then I can see one task is in {{TASK_KILLED}} status which is good:
{code}
I1008 19:58:59.928489 20665 slave.cpp:4411] Handling status update TASK_KILLED 
(UUID: e4b6562b-629a-4da8-85b8-ec7a57d1157e) for task 
simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2 of framework 
206eabc8-a6ba-430f-8ecb-b0900b26b820-0000
{code}

But the other task is in {{TASK_FAILED}} status:
{code}
I1008 19:59:00.413954 20665 slave.cpp:4411] Handling status update TASK_FAILED 
(UUID: 65467177-066c-4a5d-b04a-f35c66a6d89c) for task 
simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1 of framework 
206eabc8-a6ba-430f-8ecb-b0900b26b820-0000 from @0.0.0.0:0
{code}

And I see the executor is core dumped (I found the same core dump in 
[~zen-dog]'s agent log as well):
{code}
I1008 19:59:00.412891 20665 slave.cpp:5424] Executor 
'instance-simple-pod.c034f882-ac1f-11e7-bc9c-024255337c79' of framework 
206eabc8-a6ba-430f-8ecb-b0900b26b820-0000 terminated with signal Aborted (core 
dumped)
{code}

And then in the {{stderr}} of the executor, I found the root cause of its core 
dump.
{code}
# cat 
/opt/mesos/slaves/097ad1b0-4a33-400d-9d06-b554f9c7c009-S0/frameworks/206eabc8-a6ba-430f-8ecb-b0900b26b820-0000/executors/instance-simple-pod.c034f882-ac1f-11e7-bc9c-024255337c79/runs/019b7103-b5e4-4fa9-aaa6-e4b092167000/stderr
I1008 19:56:43.578035 20917 executor.cpp:192] Version: 1.5.0
I1008 19:56:43.612187 20938 default_executor.cpp:185] Received SUBSCRIBED event
I1008 19:56:43.614189 20938 default_executor.cpp:189] Subscribed executor on 
workstation
I1008 19:56:43.619079 20932 default_executor.cpp:185] Received LAUNCH_GROUP 
event
I1008 19:56:43.621937 20931 default_executor.cpp:394] Setting 
'MESOS_CONTAINER_IP' to: 192.168.1.3
I1008 19:56:43.703390 20933 default_executor.cpp:624] Successfully launched 
tasks [ simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1, 
simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2 ] in child 
containers [ 
019b7103-b5e4-4fa9-aaa6-e4b092167000.3d104d9e-9e66-4a55-b067-ec58e782c62a, 
019b7103-b5e4-4fa9-aaa6-e4b092167000.6f655d39-7a60-4fb5-84d3-5c9ef61070e8 ]
I1008 19:56:43.707593 20936 default_executor.cpp:697] Waiting for child 
container 
019b7103-b5e4-4fa9-aaa6-e4b092167000.3d104d9e-9e66-4a55-b067-ec58e782c62a of 
task 'simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1'
I1008 19:56:43.707864 20936 default_executor.cpp:697] Waiting for child 
container 
019b7103-b5e4-4fa9-aaa6-e4b092167000.6f655d39-7a60-4fb5-84d3-5c9ef61070e8 of 
task 'simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2'
I1008 19:56:43.730860 20938 default_executor.cpp:185] Received ACKNOWLEDGED 
event
I1008 19:56:43.762722 20937 default_executor.cpp:185] Received ACKNOWLEDGED 
event
I1008 19:58:59.719534 20934 default_executor.cpp:185] Received KILL event
I1008 19:58:59.719614 20934 default_executor.cpp:1119] Received kill for task 
'simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1'
I1008 19:58:59.719633 20934 default_executor.cpp:1004] Killing task 
simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1 running in child 
container 
019b7103-b5e4-4fa9-aaa6-e4b092167000.3d104d9e-9e66-4a55-b067-ec58e782c62a with 
SIGTERM signal
I1008 19:58:59.719643 20934 default_executor.cpp:1026] Scheduling escalation to 
SIGKILL in 3secs from now
I1008 19:58:59.723572 20931 default_executor.cpp:185] Received KILL event
I1008 19:58:59.723633 20931 default_executor.cpp:1119] Received kill for task 
'simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2'
I1008 19:58:59.723646 20931 default_executor.cpp:1004] Killing task 
simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2 running in child 
container 
019b7103-b5e4-4fa9-aaa6-e4b092167000.6f655d39-7a60-4fb5-84d3-5c9ef61070e8 with 
SIGTERM signal
I1008 19:58:59.723654 20931 default_executor.cpp:1026] Scheduling escalation to 
SIGKILL in 3secs from now
I1008 19:58:59.896623 20936 default_executor.cpp:842] Child container 
019b7103-b5e4-4fa9-aaa6-e4b092167000.6f655d39-7a60-4fb5-84d3-5c9ef61070e8 of 
task 'simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2' in state 
TASK_KILLED terminated with signal Terminated
I1008 19:58:59.896730 20936 default_executor.cpp:879] Killing task group 
containing tasks [ 
simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1, 
simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2 ]
F1008 19:58:59.896762 20936 default_executor.cpp:979] Check failed: 
!container->killing 
*** Check failure stack trace: ***
    @     0x7f9573e32c86  google::LogMessage::Fail()
    @     0x7f9573e32bce  google::LogMessage::SendToLog()
    @     0x7f9573e325a4  google::LogMessage::Flush()
    @     0x7f9573e35786  google::LogMessageFatal::~LogMessageFatal()
    @     0x55ad58dcd47d  mesos::internal::DefaultExecutor::kill()
    @     0x55ad58dcc56d  mesos::internal::DefaultExecutor::waited()
    @     0x55ad58df3f9e  
_ZZN7process8dispatchIN5mesos8internal15DefaultExecutorERKN2id4UUIDERKNS1_6TaskIDERKNS_6FutureINS_4http8ResponseEEES7_SA_SG_EEvRKNS_3PIDIT_EEMSI_FvT0_T1_T2_EOT3_OT4_OT5_ENKUlRS5_RS8_RSE_PNS_11ProcessBaseEE_clESX_SY_SZ_S11_
    @     0x55ad58e39489  
_ZNSt5_BindIFZN7process8dispatchIN5mesos8internal15DefaultExecutorERKN2id4UUIDERKNS2_6TaskIDERKNS0_6FutureINS0_4http8ResponseEEES8_SB_SH_EEvRKNS0_3PIDIT_EEMSJ_FvT0_T1_T2_EOT3_OT4_OT5_EUlRS6_RS9_RSF_PNS0_11ProcessBaseEE_S6_S9_SF_St12_PlaceholderILi1EEEE6__callIvJOS12_EJLm0ELm1ELm2ELm3EEEESJ_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
    @     0x55ad58e2d6aa  
_ZNSt5_BindIFZN7process8dispatchIN5mesos8internal15DefaultExecutorERKN2id4UUIDERKNS2_6TaskIDERKNS0_6FutureINS0_4http8ResponseEEES8_SB_SH_EEvRKNS0_3PIDIT_EEMSJ_FvT0_T1_T2_EOT3_OT4_OT5_EUlRS6_RS9_RSF_PNS0_11ProcessBaseEE_S6_S9_SF_St12_PlaceholderILi1EEEEclIJS12_EvEESN_DpOT_
    @     0x55ad58e1d734  
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEESt5_BindIFZNS0_8dispatchIN5mesos8internal15DefaultExecutorERKN2id4UUIDERKNS6_6TaskIDERKNS0_6FutureINS0_4http8ResponseEEESC_SF_SL_EEvRKNS0_3PIDIT_EEMSN_FvT0_T1_T2_EOT3_OT4_OT5_EUlRSA_RSD_RSJ_S2_E_SA_SD_SJ_St12_PlaceholderILi1EEEEE9_M_invokeERKSt9_Any_dataOS2_
    @     0x7f9573d5be2b  std::function<>::operator()()
    @     0x7f9573d3a95e  process::ProcessBase::visit()
    @     0x7f9573d47284  process::DispatchEvent::visit()
    @     0x55ad58dbb122  process::ProcessBase::serve()
    @     0x7f9573d37b2e  process::ProcessManager::resume()
    @     0x7f9573d33eda  
_ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
    @     0x7f9573d46202  
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
    @     0x7f9573d4618c  
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEclEv
    @     0x7f9573d4615c  
_ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
    @     0x7f956cfcc83f  (unknown)
    @     0x7f956c7da6da  start_thread
    @     0x7f956c514d7f  (unknown)
{code}

As the above log shows, when we delete the pod via Marathon, it will issue two 
kill task requests to Mesos at the same time, and the default executor receives 
two `KILL` event to kill the two tasks. When the one of the task is killed, in 
{{DefaultExecutor::waited}}, we will call {{DefaultExecutor::kill()}} to kill 
the other task, and in that method there is a check:
{code}
CHECK(!container->killing);
{code}
This check will fail since the other container is already being killed! That's 
why the default executor core dumped.

> Killing TASK_GROUP fail to kill some tasks
> ------------------------------------------
>
>                 Key: MESOS-8051
>                 URL: https://issues.apache.org/jira/browse/MESOS-8051
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, executor
>    Affects Versions: 1.4.0
>            Reporter: A. Dukhovniy
>            Assignee: Qian Zhang
>            Priority: Critical
>         Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, 
> screenshot-1.png
>
>
> When starting following pod definition via marathon:
> {code:java}
> {
>   "id": "/simple-pod",
>   "scaling": {
>     "kind": "fixed",
>     "instances": 3
>   },
>   "environment": {
>     "PING": "PONG"
>   },
>   "containers": [
>     {
>       "name": "ct1",
>       "resources": {
>         "cpus": 0.1,
>         "mem": 32
>       },
>       "image": {
>         "kind": "MESOS",
>         "id": "busybox"
>       },
>       "exec": {
>         "command": {
>           "shell": "while true; do echo the current time is $(date) > 
> ./test-v1/clock; sleep 1; done"
>         }
>       },
>       "volumeMounts": [
>         {
>           "name": "v1",
>           "mountPath": "test-v1"
>         }
>       ]
>     },
>     {
>       "name": "ct2",
>       "resources": {
>         "cpus": 0.1,
>         "mem": 32
>       },
>       "exec": {
>         "command": {
>           "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 
> 1; done"
>         }
>       },
>       "volumeMounts": [
>         {
>           "name": "v1",
>           "mountPath": "etc"
>         },
>         {
>           "name": "v2",
>           "mountPath": "docker"
>         }
>       ]
>     }
>   ],
>   "networks": [
>     {
>       "mode": "host"
>     }
>   ],
>   "volumes": [
>     {
>       "name": "v1"
>     },
>     {
>       "name": "v2",
>       "host": "/var/lib/docker"
>     }
>   ]
> }
> {code}
> mesos will successfully kill all {{ct2}} containers but fail to kill all/some 
> of the {{ct1}} containers. I've attached both master and agent logs. The 
> interesting part starts after marathon issues 6 kills:
> {code:java}
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.209966  4746 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d
> bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210033  4746 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210471  4748 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d
> bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210518  4748 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210602  4748 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d
> bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210639  4748 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210932  4753 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d
> bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210968  4753 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct2 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211210  4747 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853d
> bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211251  4747 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211474  4746 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853d
> bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211514  4746 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853dbf20.ct2 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> {code}
> All {{.ct1}} tasks fail eventually (~30s) where {{.ct2}} are successfully 
> killed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-8051) Killing TASK_GROUP fail to kill some tasks

Reply via email to