[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171182#comment-16171182 ]
Qian Zhang commented on MESOS-7963: ----------------------------------- [~jpe...@apache.org] In your second example, a task group with two tasks was launched, when disk/du isolator raised a limitation for the root container, Mesos containerizer will try to destroy the root container, but before that, it will try to destroy the two nested containers first. So when the first nested container was destroyed, the default executor will know it (since the default executor was still alive at that moment) and it will send a {{TASK_FAILED}} for the first task and the source is {{SOURCE_EXECUTOR}}. For the second task, I think before the default executor got a chance to send status update for it, the default executor itself was destroyed by Mesos containerizer, that's why we see a {{TASK_FAILED}} for the second task and its source is {{SOURCE_AGENT}}. In your first example, the task group has only one task, so I think it follow the same flow of the first task in your second example, i.e., the default executor sent a {{TASK_FAILED}} for the task, and then the default executor itself was destroyed (or maybe self terminated). Currently both cgroups isolator (memory subsystem) and disk/du isolator raise the limitation for root container rather than nested container. I think we may need to change them to raise the limitation for nested container, and enhance the implementation of waitNestedContainer() to make it propagate the reason and message of the container termination to the default executor, and then the default executor can send the status update with the reason and message for the nested container to the scheduler. > Task groups can lose the container limitation status. > ----------------------------------------------------- > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor > Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 10000" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }], > "command": { > "value": "sleep 600" > } > }, { > "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d", > "task_id": {"value" : "a7571608-3a53-4971-a187-41ed8be183ba"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 10000" > } > } > ] > }' > I0911 12:29:17.772161 7655 scheduler.cpp:184] Version: 1.5.0 > I0911 12:29:17.780640 7661 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0011 > Submitted task group with tasks [ 1372b2e2-c501-4e80-bcbd-1a5c5194e206, > a7571608-3a53-4971-a187-41ed8be183ba ] to agent > 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '1372b2e2-c501-4e80-bcbd-1a5c5194e206' > source: SOURCE_EXECUTOR > Received status update TASK_RUNNING for task > 'a7571608-3a53-4971-a187-41ed8be183ba' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '1372b2e2-c501-4e80-bcbd-1a5c5194e206' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > 'a7571608-3a53-4971-a187-41ed8be183ba' > message: 'Disk usage (65556KB) exceeds quota (34MB)' > source: SOURCE_AGENT > reason: REASON_CONTAINER_LIMITATION_DISK > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)