[jira] [Commented] (MESOS-8051) Killing TASK_GROUP fail to kill some tasks

Qian Zhang (JIRA) Mon, 30 Oct 2017 22:05:40 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226257#comment-16226257
 ]


Qian Zhang commented on MESOS-8051:
-----------------------------------

commit 05c7dd88f269692b7248c1087a3f57759eba6853
Author: Qian Zhang <zhq527...@gmail.com>
Date:   Mon Oct 9 09:01:15 2017 +0800

    Ignored the tasks already being killed when killing the task group.
    
    When the scheduler tries to kill multiple tasks in the task group
    simultaneously, the default executor will kill the tasks one by
    one. When the first task is killed, the default executor will kill
    all the other tasks in the task group, however, we need to ignore
    the tasks which are already being killed, otherwise, the check
    `CHECK(!container->killing);` in `DefaultExecutor::kill()` will fail.
    
    Review: https://reviews.apache.org/r/62836

commit 28831de34d098c894042246dd6fef402eb3b960d
Author: Qian Zhang <zhq527...@gmail.com>
Date:   Mon Oct 9 14:25:31 2017 +0800

    Added a test `DefaultExecutorTest.KillMultipleTasks`.
    
    Review: https://reviews.apache.org/r/62837

> Killing TASK_GROUP fail to kill some tasks
> ------------------------------------------
>
>                 Key: MESOS-8051
>                 URL: https://issues.apache.org/jira/browse/MESOS-8051
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, executor
>    Affects Versions: 1.4.0
>            Reporter: A. Dukhovniy
>            Assignee: Qian Zhang
>            Priority: Critical
>         Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, 
> screenshot-1.png
>
>
> When starting following pod definition via marathon:
> {code:java}
> {
>   "id": "/simple-pod",
>   "scaling": {
>     "kind": "fixed",
>     "instances": 3
>   },
>   "environment": {
>     "PING": "PONG"
>   },
>   "containers": [
>     {
>       "name": "ct1",
>       "resources": {
>         "cpus": 0.1,
>         "mem": 32
>       },
>       "image": {
>         "kind": "MESOS",
>         "id": "busybox"
>       },
>       "exec": {
>         "command": {
>           "shell": "while true; do echo the current time is $(date) > 
> ./test-v1/clock; sleep 1; done"
>         }
>       },
>       "volumeMounts": [
>         {
>           "name": "v1",
>           "mountPath": "test-v1"
>         }
>       ]
>     },
>     {
>       "name": "ct2",
>       "resources": {
>         "cpus": 0.1,
>         "mem": 32
>       },
>       "exec": {
>         "command": {
>           "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 
> 1; done"
>         }
>       },
>       "volumeMounts": [
>         {
>           "name": "v1",
>           "mountPath": "etc"
>         },
>         {
>           "name": "v2",
>           "mountPath": "docker"
>         }
>       ]
>     }
>   ],
>   "networks": [
>     {
>       "mode": "host"
>     }
>   ],
>   "volumes": [
>     {
>       "name": "v1"
>     },
>     {
>       "name": "v2",
>       "host": "/var/lib/docker"
>     }
>   ]
> }
> {code}
> mesos will successfully kill all {{ct2}} containers but fail to kill all/some 
> of the {{ct1}} containers. I've attached both master and agent logs. The 
> interesting part starts after marathon issues 6 kills:
> {code:java}
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.209966  4746 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d
> bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210033  4746 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210471  4748 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d
> bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210518  4748 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210602  4748 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d
> bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210639  4748 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210932  4753 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d
> bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210968  4753 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct2 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211210  4747 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853d
> bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211251  4747 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211474  4746 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853d
> bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211514  4746 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853dbf20.ct2 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
> .229:15101
> {code}
> All {{.ct1}} tasks fail eventually (~30s) where {{.ct2}} are successfully 
> killed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-8051) Killing TASK_GROUP fail to kill some tasks

Reply via email to