[
https://issues.apache.org/jira/browse/MESOS-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333088#comment-16333088
]
Gilbert Song commented on MESOS-7939:
-------------------------------------
I am downgrading this issue to `Major` in priority, since we don't have a very
good solution for now and the potential solution would still need operators to
manually restart the agent. Operators could clean up the sandboxes as a
workaround.
> Early disk usage check for garbage collection during recovery
> -------------------------------------------------------------
>
> Key: MESOS-7939
> URL: https://issues.apache.org/jira/browse/MESOS-7939
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Reporter: Chun-Hung Hsiao
> Assignee: Chun-Hung Hsiao
> Priority: Major
>
> Currently the default value for `disk_watch_interval` is 1 minute. This is
> not fast enough and could lead to the following scenario:
> 1. The disk usage was checked and there was not enough headroom:
> {noformat}
> I0901 17:54:33.000000 25510 slave.cpp:5896] Current disk usage 99.87%. Max
> allowed age: 0ns
> {noformat}
> But no container was pruned because no container had been scheduled for GC.
> 2. A task was completed. The task itself contained a lot of nested
> containers, each used a lot of disk space. Note that there is no way for
> Mesos agent to schedule individual nested containers for GC since nested
> containers are not necessarily tied to tasks. When the top-lovel container is
> completed, it was scheduled for GC, and the nested containers would be GC'ed
> as well:
> {noformat}
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
> for gc 1.99999466483852days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
> for gc 1.99999466405037days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
> for gc 1.9999946635763days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
> for gc 1.99999466324148days in the future
> {noformat}
> 3. Since the next disk usage check was still 40ish seconds away, no GC was
> performed even though the disk was full. As a result, Mesos agent failed to
> checkpoint the task status:
> {noformat}
> I0901 17:54:49.000000 25513 status_update_manager.cpp:323] Received status
> update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> F0901 17:54:49.000000 25513 slave.cpp:4748] CHECK_READY(future): is FAILED:
> Failed to open
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
> for status updates: No space left on device Failed to handle status update
> TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> 4. When the agent restarted, it tried to checkpoint the task status again.
> However, since the first disk usage check was scheduled 1 minute after
> startup, the agent failed before GC kicked in, falling into a restart failure
> loop:
> {noformat}
> F0901 17:55:06.000000 31114 slave.cpp:4748] CHECK_READY(future): is FAILED:
> Failed to open
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
> for status updates: No space left on device Failed to handle status update
> TASK_FAILED (UUID: fb9c3951-9a93-4925-a7f0-9ba7e38d2398) for task
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> We should kick in GC early, so the agent can recover from this state.
> Related ticket: MESOS-7031
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)