[jira] [Commented] (MESOS-7939) Early disk usage check for garbage collection during recovery

Gilbert Song (JIRA) Fri, 19 Jan 2018 17:13:19 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333088#comment-16333088
 ]


Gilbert Song commented on MESOS-7939:
-------------------------------------

I am downgrading this issue to `Major` in priority, since we don't have a very 
good solution for now and the potential solution would still need operators to 
manually restart the agent. Operators could clean up the sandboxes as a 
workaround.

> Early disk usage check for garbage collection during recovery
> -------------------------------------------------------------
>
>                 Key: MESOS-7939
>                 URL: https://issues.apache.org/jira/browse/MESOS-7939
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>            Priority: Major
>
> Currently the default value for `disk_watch_interval` is 1 minute. This is 
> not fast enough and could lead to the following scenario:
> 1. The disk usage was checked and there was not enough headroom:
> {noformat}
> I0901 17:54:33.000000 25510 slave.cpp:5896] Current disk usage 99.87%. Max 
> allowed age: 0ns
> {noformat}
> But no container was pruned because no container had been scheduled for GC.
> 2. A task was completed. The task itself contained a lot of nested 
> containers, each used a lot of disk space. Note that there is no way for 
> Mesos agent to schedule individual nested containers for GC since nested 
> containers are not necessarily tied to tasks. When the top-lovel container is 
> completed, it was scheduled for GC, and the nested containers would be GC'ed 
> as well: 
> {noformat}
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
>  for gc 1.99999466483852days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
>  for gc 1.99999466405037days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
>  for gc 1.9999946635763days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
>  for gc 1.99999466324148days in the future
> {noformat}
> 3. Since the next disk usage check was still 40ish seconds away, no GC was 
> performed even though the disk was full. As a result, Mesos agent failed to 
> checkpoint the task status:
> {noformat}
> I0901 17:54:49.000000 25513 status_update_manager.cpp:323] Received status 
> update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> F0901 17:54:49.000000 25513 slave.cpp:4748] CHECK_READY(future): is FAILED: 
> Failed to open 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
>  for status updates: No space left on device Failed to handle status update 
> TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> 4. When the agent restarted, it tried to checkpoint the task status again. 
> However, since the first disk usage check was scheduled 1 minute after 
> startup, the agent failed before GC kicked in, falling into a restart failure 
> loop:
> {noformat}
> F0901 17:55:06.000000 31114 slave.cpp:4748] CHECK_READY(future): is FAILED: 
> Failed to open 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
>  for status updates: No space left on device Failed to handle status update 
> TASK_FAILED (UUID: fb9c3951-9a93-4925-a7f0-9ba7e38d2398) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> We should kick in GC early, so the agent can recover from this state.
> Related ticket: MESOS-7031



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-7939) Early disk usage check for garbage collection during recovery

Reply via email to