[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226257#comment-16226257 ]
Qian Zhang commented on MESOS-8051: ----------------------------------- commit 05c7dd88f269692b7248c1087a3f57759eba6853 Author: Qian Zhang <zhq527...@gmail.com> Date: Mon Oct 9 09:01:15 2017 +0800 Ignored the tasks already being killed when killing the task group. When the scheduler tries to kill multiple tasks in the task group simultaneously, the default executor will kill the tasks one by one. When the first task is killed, the default executor will kill all the other tasks in the task group, however, we need to ignore the tasks which are already being killed, otherwise, the check `CHECK(!container->killing);` in `DefaultExecutor::kill()` will fail. Review: https://reviews.apache.org/r/62836 commit 28831de34d098c894042246dd6fef402eb3b960d Author: Qian Zhang <zhq527...@gmail.com> Date: Mon Oct 9 14:25:31 2017 +0800 Added a test `DefaultExecutorTest.KillMultipleTasks`. Review: https://reviews.apache.org/r/62837 > Killing TASK_GROUP fail to kill some tasks > ------------------------------------------ > > Key: MESOS-8051 > URL: https://issues.apache.org/jira/browse/MESOS-8051 > Project: Mesos > Issue Type: Bug > Components: agent, executor > Affects Versions: 1.4.0 > Reporter: A. Dukhovniy > Assignee: Qian Zhang > Priority: Critical > Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, > screenshot-1.png > > > When starting following pod definition via marathon: > {code:java} > { > "id": "/simple-pod", > "scaling": { > "kind": "fixed", > "instances": 3 > }, > "environment": { > "PING": "PONG" > }, > "containers": [ > { > "name": "ct1", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "image": { > "kind": "MESOS", > "id": "busybox" > }, > "exec": { > "command": { > "shell": "while true; do echo the current time is $(date) > > ./test-v1/clock; sleep 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "test-v1" > } > ] > }, > { > "name": "ct2", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "exec": { > "command": { > "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep > 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "etc" > }, > { > "name": "v2", > "mountPath": "docker" > } > ] > } > ], > "networks": [ > { > "mode": "host" > } > ], > "volumes": [ > { > "name": "v1" > }, > { > "name": "v2", > "host": "/var/lib/docker" > } > ] > } > {code} > mesos will successfully kill all {{ct2}} containers but fail to kill all/some > of the {{ct1}} containers. I've attached both master and agent logs. The > interesting part starts after marathon issues 6 kills: > {code:java} > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210932 4753 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d > bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210968 4753 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct2 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.211210 4747 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.211251 4747 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.211474 4746 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853d > bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.211514 4746 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853dbf20.ct2 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > {code} > All {{.ct1}} tasks fail eventually (~30s) where {{.ct2}} are successfully > killed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)