[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172846#comment-16172846
 ] 

Qian Zhang commented on MESOS-7963:
-----------------------------------

[~jieyu] and [~jpe...@apache.org], so I think what we should do is:
# In {{MesosContainerizerProcess::limited()}}, check if it is a root container, 
if yes, not only set root container's {{limitations}} but also set all its 
nested container's {{limitations}}.
# In {{Http::waitNestedContainer()}}, propagate the reason and message of the 
nested container termination to the default executor so that the default 
executor can send the status update with the reason and message for the nested 
container to the scheduler.
# For `network/ports` isolator, we also want it to raise limitation only for 
root container rather than any nested containers, just like `cgroups/mem` and 
`disk/du`.

Please correct me if I missed anything. And my comment is, for a task group 
with multiple tasks, if one of the tasks reaches the limit (e.g., it listens on 
an unallocated port), the scheduler will receive {{TASK_FAILED}} for all the 
tasks with the exactly same reason and message, I think this will confuse 
scheduler because schedule can not figure out which particular task reached the 
limit.

> Task groups can lose the container limitation status.
> -----------------------------------------------------
>
>                 Key: MESOS-7963
>                 URL: https://issues.apache.org/jira/browse/MESOS-7963
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, executor
>            Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
>     [
>         {
>             "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
>             "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
>             "agent_id": {"value" : ""},
>             "resources": [{
>                 "name": "cpus",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 0.2
>                 }
>             }, {
>                 "name": "mem",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 32
>                 }
>             }, {
>                 "name": "disk",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 2
>                 }
>             }
>             ],
>             "command": {
>                 "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 10000"
>             }
>         }
>     ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
>     [
>         {
>             "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
>             "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
>             "agent_id": {"value" : ""},
>             "resources": [{
>                 "name": "cpus",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 0.2
>                 }
>             },
>             {
>                 "name": "mem",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 32
>                 }
>             }],
>             "command": {
>                 "value": "sleep 600"
>             }
>         }, {
>             "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d",
>             "task_id": {"value" : "a7571608-3a53-4971-a187-41ed8be183ba"},
>             "agent_id": {"value" : ""},
>             "resources": [{
>                 "name": "cpus",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 0.2
>                 }
>             }, {
>                 "name": "mem",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 32
>                 }
>             }, {
>                 "name": "disk",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 2
>                 }
>             }
>             ],
>             "command": {
>                 "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 10000"
>             }
>         }
>     ]
> }'
> I0911 12:29:17.772161  7655 scheduler.cpp:184] Version: 1.5.0
> I0911 12:29:17.780640  7661 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0011
> Submitted task group with tasks [ 1372b2e2-c501-4e80-bcbd-1a5c5194e206, 
> a7571608-3a53-4971-a187-41ed8be183ba ] to agent 
> 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '1372b2e2-c501-4e80-bcbd-1a5c5194e206'
>   source: SOURCE_EXECUTOR
> Received status update TASK_RUNNING for task 
> 'a7571608-3a53-4971-a187-41ed8be183ba'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '1372b2e2-c501-4e80-bcbd-1a5c5194e206'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> 'a7571608-3a53-4971-a187-41ed8be183ba'
>   message: 'Disk usage (65556KB) exceeds quota (34MB)'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LIMITATION_DISK
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to