[ 
https://issues.apache.org/jira/browse/MESOS-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424606#comment-16424606
 ] 

Chun-Hung Hsiao commented on MESOS-7939:
----------------------------------------

Notes for this problem:

All these sandboxes are for nested containers, and we don't actively GC nested 
containers because the top-level container (executor) is still running. If the 
executor doesn't actively remove the container sandboxes (through API), and we 
don't enforce disk usage for the whole executor, then the disk would eventually 
become full.

The problem is, we're now lacking a mechanism to, instead of killing the 
executor when it uses up its disk, inform the executor that it runs out of disk 
so it can do proper cleanup.

All improvements on GC (setting up a larger GC headroom, setting up a shorter 
disk watch interval, or kicking in GC early as proposed by this ticket) are 
only useful after the agent fails. None of these approaches will prevent the 
disk from being used up at the beginning, because sandbox for completed nested 
containers are considered "active" unless the executor explicitly removes them.

> Early disk usage check for garbage collection during recovery
> -------------------------------------------------------------
>
>                 Key: MESOS-7939
>                 URL: https://issues.apache.org/jira/browse/MESOS-7939
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>            Priority: Major
>
> Currently the default value for `disk_watch_interval` is 1 minute. This is 
> not fast enough and could lead to the following scenario:
> 1. The disk usage was checked and there was not enough headroom:
> {noformat}
> I0901 17:54:33.000000 25510 slave.cpp:5896] Current disk usage 99.87%. Max 
> allowed age: 0ns
> {noformat}
> But no container was pruned because no container had been scheduled for GC.
> 2. A task was completed. The task itself contained a lot of nested 
> containers, each used a lot of disk space. Note that there is no way for 
> Mesos agent to schedule individual nested containers for GC since nested 
> containers are not necessarily tied to tasks. When the top-lovel container is 
> completed, it was scheduled for GC, and the nested containers would be GC'ed 
> as well: 
> {noformat}
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
>  for gc 1.99999466483852days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
>  for gc 1.99999466405037days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
>  for gc 1.9999946635763days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
>  for gc 1.99999466324148days in the future
> {noformat}
> 3. Since the next disk usage check was still 40ish seconds away, no GC was 
> performed even though the disk was full. As a result, Mesos agent failed to 
> checkpoint the task status:
> {noformat}
> I0901 17:54:49.000000 25513 status_update_manager.cpp:323] Received status 
> update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> F0901 17:54:49.000000 25513 slave.cpp:4748] CHECK_READY(future): is FAILED: 
> Failed to open 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
>  for status updates: No space left on device Failed to handle status update 
> TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> 4. When the agent restarted, it tried to checkpoint the task status again. 
> However, since the first disk usage check was scheduled 1 minute after 
> startup, the agent failed before GC kicked in, falling into a restart failure 
> loop:
> {noformat}
> F0901 17:55:06.000000 31114 slave.cpp:4748] CHECK_READY(future): is FAILED: 
> Failed to open 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
>  for status updates: No space left on device Failed to handle status update 
> TASK_FAILED (UUID: fb9c3951-9a93-4925-a7f0-9ba7e38d2398) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> We should kick in GC early, so the agent can recover from this state.
> Related ticket: MESOS-7031



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to