[ 
https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191502#comment-16191502
 ] 

A. Dukhovniy commented on MESOS-8051:
-------------------------------------

Here logs for one of the failing tasks from master:

{code:java}
40268:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
mesos-master[4708]: I1004 14:58:25.210602  4748 master.cpp:5297] Processing 
KILL call for task 
'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1' of framework 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
[email protected]:15101
40269:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
mesos-master[4708]: I1004 14:58:25.210639  4748 master.cpp:5371] Telling agent 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 
(10.0.1.207) to kill task 
simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
[email protected]:15101
40287:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
mesos-master[4708]: I1004 14:58:25.331063  4747 master.cpp:6841] Status update 
TASK_KILLING (UUID: 23c6e28b-4370-4da3-981c-13a121b145c0) for task 
simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 from agent 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (10.0.1.207)
40288:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
mesos-master[4708]: I1004 14:58:25.331110  4747 master.cpp:6903] Forwarding 
status update TASK_KILLING (UUID: 23c6e28b-4370-4da3-981c-13a121b145c0) for 
task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001
40289:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
mesos-master[4708]: I1004 14:58:25.331193  4747 master.cpp:8928] Updating the 
state of task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of 
framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (latest state: 
TASK_KILLING, status update state: TASK_KILLING)
40297:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
mesos-master[4708]: I1004 14:58:25.341003  4750 master.cpp:5479] Processing 
ACKNOWLEDGE call 23c6e28b-4370-4da3-981c-13a121b145c0 for task 
simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
[email protected]:15101 on agent 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1
40337:Oct 04 14:58:35 ip-10-0-5-229.eu-central-1.compute.internal 
mesos-master[4708]: I1004 14:58:35.229382  4746 master.cpp:5297] Processing 
KILL call for task 
'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1' of framework 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
[email protected]:15101
40338:Oct 04 14:58:35 ip-10-0-5-229.eu-central-1.compute.internal 
mesos-master[4708]: I1004 14:58:35.229418  4746 master.cpp:5371] Telling agent 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 
(10.0.1.207) to kill task 
simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
[email protected]:15101
40372:Oct 04 14:58:55 ip-10-0-5-229.eu-central-1.compute.internal 
mesos-master[4708]: I1004 14:58:55.168781  4752 master.cpp:6841] Status update 
TASK_FAILED (UUID: 57b5c03e-517c-4dc2-8592-c24e5c875fde) for task 
simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 from agent 
bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (10.0.1.207)
{code}

It takes ~30s, marathon issues 2 kills in the meantime and eventually 
{{TASK_FAILED}} is received. 

> Killing TASK_GROUP fails to kill some tasks
> -------------------------------------------
>
>                 Key: MESOS-8051
>                 URL: https://issues.apache.org/jira/browse/MESOS-8051
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, executor
>    Affects Versions: 1.4.0
>            Reporter: A. Dukhovniy
>            Priority: Critical
>         Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, 
> screenshot-1.png
>
>
> When starting following pod definition via marathon:
> {code:java}
> {
>   "id": "/simple-pod",
>   "scaling": {
>     "kind": "fixed",
>     "instances": 3
>   },
>   "environment": {
>     "PING": "PONG"
>   },
>   "containers": [
>     {
>       "name": "ct1",
>       "resources": {
>         "cpus": 0.1,
>         "mem": 32
>       },
>       "image": {
>         "kind": "MESOS",
>         "id": "busybox"
>       },
>       "exec": {
>         "command": {
>           "shell": "while true; do echo the current time is $(date) > 
> ./test-v1/clock; sleep 1; done"
>         }
>       },
>       "volumeMounts": [
>         {
>           "name": "v1",
>           "mountPath": "test-v1"
>         }
>       ]
>     },
>     {
>       "name": "ct2",
>       "resources": {
>         "cpus": 0.1,
>         "mem": 32
>       },
>       "exec": {
>         "command": {
>           "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 
> 1; done"
>         }
>       },
>       "volumeMounts": [
>         {
>           "name": "v1",
>           "mountPath": "etc"
>         },
>         {
>           "name": "v2",
>           "mountPath": "docker"
>         }
>       ]
>     }
>   ],
>   "networks": [
>     {
>       "mode": "host"
>     }
>   ],
>   "volumes": [
>     {
>       "name": "v1"
>     },
>     {
>       "name": "v2",
>       "host": "/var/lib/docker"
>     }
>   ]
> }
> {code}
> mesos will successfully kill all {{ct2}} containers but fail to kill all/some 
> of the {{ct1}} containers. I've attached both master and agent logs. The 
> interesting part starts after marathon issues 6 kills:
> {code:java}
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.209966  4746 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d
> bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210033  4746 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210471  4748 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d
> bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210518  4748 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210602  4748 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d
> bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210639  4748 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210932  4753 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d
> bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.210968  4753 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct2 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211210  4747 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853d
> bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211251  4747 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853dbf20.ct1 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211474  4746 master.cpp:5297] Processing 
> KILL call for task 'simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853d
> bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) 
> at [email protected]:15101
> Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal 
> mesos-master[4708]: I1004 14:58:25.211514  4746 master.cpp:5371] Telling 
> agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
> 10.0.1.207) to kill task 
> simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853dbf20.ct2 of framework 
> bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at 
> [email protected]
> .229:15101
> {code}
> All {{.ct1}} tasks FAILED where {{.ct2}} were successfully killed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to