[
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172089#comment-16172089
]
James Peach commented on MESOS-7963:
------------------------------------
As per discussion with [~jieyu] the right fix here is for the containerizer to
propagate the container limitation to the {{WAIT_NESTED_CONTAINER}} response
when it destroys the nested containers. Then the executor should pass that
information on in status updates. Though there will still be a race between the
executor and the agent, both parties will still end up propagating the same
status update information.
> Task groups can lose the container limitation status.
> -----------------------------------------------------
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
> Issue Type: Bug
> Components: containerization, executor
> Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container
> limitation, that status update can be lost and only the executor failure will
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M
> count=64 ; sleep 10000"
> }
> }
> ]
> }'
> I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at
> [email protected]:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
> source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
> message: 'Command terminated with signal Killed'
> source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818 7012 http.cpp:532] Processing call
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to [email protected]:5050
> I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested:
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
> "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> },
> {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }],
> "command": {
> "value": "sleep 600"
> }
> }, {
> "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d",
> "task_id": {"value" : "a7571608-3a53-4971-a187-41ed8be183ba"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M
> count=64 ; sleep 10000"
> }
> }
> ]
> }'
> I0911 12:29:17.772161 7655 scheduler.cpp:184] Version: 1.5.0
> I0911 12:29:17.780640 7661 scheduler.cpp:470] New master detected at
> [email protected]:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0011
> Submitted task group with tasks [ 1372b2e2-c501-4e80-bcbd-1a5c5194e206,
> a7571608-3a53-4971-a187-41ed8be183ba ] to agent
> 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task
> '1372b2e2-c501-4e80-bcbd-1a5c5194e206'
> source: SOURCE_EXECUTOR
> Received status update TASK_RUNNING for task
> 'a7571608-3a53-4971-a187-41ed8be183ba'
> source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task
> '1372b2e2-c501-4e80-bcbd-1a5c5194e206'
> message: 'Command terminated with signal Killed'
> source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task
> 'a7571608-3a53-4971-a187-41ed8be183ba'
> message: 'Disk usage (65556KB) exceeds quota (34MB)'
> source: SOURCE_AGENT
> reason: REASON_CONTAINER_LIMITATION_DISK
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)