[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

James Peach (JIRA) Tue, 19 Sep 2017 01:48:13 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171327#comment-16171327
 ]


James Peach commented on MESOS-7963:
------------------------------------

{quote}
For the second task, I think before the default executor got a chance to send 
status update for it, the default executor itself was destroyed by Mesos 
containerizer, that's why we see a TASK_FAILED for the second task and its 
source is SOURCE_AGENT.
{quote}

Yes I think this would also happen if the executor was terminated before the 
status update from the failed task was acknowledged, though I'm not positive 
whether this can actually happen. When running a single task, you will 
sometimes get different status update results, implying that there is a race.

{quote}
Currently both cgroups isolator (memory subsystem) and disk/du isolator raise 
the limitation for root container rather than nested container.
{quote}

This is consistent since resources are always accumulated on the root 
container. I'm not sure that it is feasible to detect when a nested container 
triggers a limitation for these resources.

> Task groups can lose the container limitation status.
> -----------------------------------------------------
>
>                 Key: MESOS-7963
>                 URL: https://issues.apache.org/jira/browse/MESOS-7963
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, executor
>            Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
>     [
>         {
>             "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
>             "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
>             "agent_id": {"value" : ""},
>             "resources": [{
>                 "name": "cpus",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 0.2
>                 }
>             }, {
>                 "name": "mem",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 32
>                 }
>             }, {
>                 "name": "disk",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 2
>                 }
>             }
>             ],
>             "command": {
>                 "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 10000"
>             }
>         }
>     ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> [email protected]:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to [email protected]:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
>     [
>         {
>             "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
>             "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
>             "agent_id": {"value" : ""},
>             "resources": [{
>                 "name": "cpus",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 0.2
>                 }
>             },
>             {
>                 "name": "mem",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 32
>                 }
>             }],
>             "command": {
>                 "value": "sleep 600"
>             }
>         }, {
>             "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d",
>             "task_id": {"value" : "a7571608-3a53-4971-a187-41ed8be183ba"},
>             "agent_id": {"value" : ""},
>             "resources": [{
>                 "name": "cpus",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 0.2
>                 }
>             }, {
>                 "name": "mem",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 32
>                 }
>             }, {
>                 "name": "disk",
>                 "type": "SCALAR",
>                 "scalar": {
>                     "value": 2
>                 }
>             }
>             ],
>             "command": {
>                 "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 10000"
>             }
>         }
>     ]
> }'
> I0911 12:29:17.772161  7655 scheduler.cpp:184] Version: 1.5.0
> I0911 12:29:17.780640  7661 scheduler.cpp:470] New master detected at 
> [email protected]:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0011
> Submitted task group with tasks [ 1372b2e2-c501-4e80-bcbd-1a5c5194e206, 
> a7571608-3a53-4971-a187-41ed8be183ba ] to agent 
> 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '1372b2e2-c501-4e80-bcbd-1a5c5194e206'
>   source: SOURCE_EXECUTOR
> Received status update TASK_RUNNING for task 
> 'a7571608-3a53-4971-a187-41ed8be183ba'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '1372b2e2-c501-4e80-bcbd-1a5c5194e206'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> 'a7571608-3a53-4971-a187-41ed8be183ba'
>   message: 'Disk usage (65556KB) exceeds quota (34MB)'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LIMITATION_DISK
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

Reply via email to