[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174971#comment-16174971
 ] 

Qian Zhang commented on MESOS-7963:
-----------------------------------

Can you let me know what the special case is? I think currently when the 
default executor gets a limitation, it will kill all other nested containers 
and then terminate itself, I do not think we need to change this. And even 
without my proposal (i.e., raise limitation only for root container), all the 
nested containers will be killed as well (by Mesos containerizer), so the 
result is same, I am not sure when we need to restart the nested container.

> Task groups can lose the container limitation status.
> -----------------------------------------------------
>
>                 Key: MESOS-7963
>                 URL: https://issues.apache.org/jira/browse/MESOS-7963
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, executor
>            Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
>     [
>         {
>             "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
>             "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
>             "agent_id": {"value" : ""},
>             "resources": [{
>                 "name": "cpus",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 0.2
>                 }
>             }, {
>                 "name": "mem",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 32
>                 }
>             }, {
>                 "name": "disk",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 2
>                 }
>             }
>             ],
>             "command": {
>                 "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 10000"
>             }
>         }
>     ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> [email protected]:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to [email protected]:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
>     [
>         {
>             "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
>             "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
>             "agent_id": {"value" : ""},
>             "resources": [{
>                 "name": "cpus",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 0.2
>                 }
>             },
>             {
>                 "name": "mem",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 32
>                 }
>             }],
>             "command": {
>                 "value": "sleep 600"
>             }
>         }, {
>             "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d",
>             "task_id": {"value" : "a7571608-3a53-4971-a187-41ed8be183ba"},
>             "agent_id": {"value" : ""},
>             "resources": [{
>                 "name": "cpus",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 0.2
>                 }
>             }, {
>                 "name": "mem",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 32
>                 }
>             }, {
>                 "name": "disk",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 2
>                 }
>             }
>             ],
>             "command": {
>                 "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 10000"
>             }
>         }
>     ]
> }'
> I0911 12:29:17.772161  7655 scheduler.cpp:184] Version: 1.5.0
> I0911 12:29:17.780640  7661 scheduler.cpp:470] New master detected at 
> [email protected]:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0011
> Submitted task group with tasks [ 1372b2e2-c501-4e80-bcbd-1a5c5194e206, 
> a7571608-3a53-4971-a187-41ed8be183ba ] to agent 
> 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '1372b2e2-c501-4e80-bcbd-1a5c5194e206'
>   source: SOURCE_EXECUTOR
> Received status update TASK_RUNNING for task 
> 'a7571608-3a53-4971-a187-41ed8be183ba'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '1372b2e2-c501-4e80-bcbd-1a5c5194e206'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> 'a7571608-3a53-4971-a187-41ed8be183ba'
>   message: 'Disk usage (65556KB) exceeds quota (34MB)'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LIMITATION_DISK
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to