[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173413#comment-16173413 ]
James Peach commented on MESOS-7963: ------------------------------------ I think that steps 1 - 3 sound fine and will fix the immediate problem. {quote} if one of the tasks reaches the limit ... the scheduler will receive TASK_FAILED for all the tasks with the exactly same reason and message {quote} This can happen today, though the raciness generally hides it. In most case, I think that this behavior is correct and the scheduler ought to deal with it. For example, it is possible for one container to use most of the memory and a smaller container to trigger the OOM (similarly with disk). In these cases, it is hard to make a principled decision about which container is responsible. {quote} least for {{network/ports}} isolator, we should make it raise the limitation for nested container since it has the ability to tell which nested container reaches the limit {quote} We could do this for the {{network/ports}} isolator, but I would generally prefer the resource limitations to have a single well-defined behavior. If we let this isolator be special, then the flexibility that allows is inconsistent, so it ends up being an extra special case that executors might have to deal with. At least as a starting point, I'd like the behavior to be consistent and well-defined. > Task groups can lose the container limitation status. > ----------------------------------------------------- > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor > Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 10000" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }], > "command": { > "value": "sleep 600" > } > }, { > "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d", > "task_id": {"value" : "a7571608-3a53-4971-a187-41ed8be183ba"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 10000" > } > } > ] > }' > I0911 12:29:17.772161 7655 scheduler.cpp:184] Version: 1.5.0 > I0911 12:29:17.780640 7661 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0011 > Submitted task group with tasks [ 1372b2e2-c501-4e80-bcbd-1a5c5194e206, > a7571608-3a53-4971-a187-41ed8be183ba ] to agent > 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '1372b2e2-c501-4e80-bcbd-1a5c5194e206' > source: SOURCE_EXECUTOR > Received status update TASK_RUNNING for task > 'a7571608-3a53-4971-a187-41ed8be183ba' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '1372b2e2-c501-4e80-bcbd-1a5c5194e206' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > 'a7571608-3a53-4971-a187-41ed8be183ba' > message: 'Disk usage (65556KB) exceeds quota (34MB)' > source: SOURCE_AGENT > reason: REASON_CONTAINER_LIMITATION_DISK > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)