[ 
https://issues.apache.org/jira/browse/MESOS-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7939:
-----------------------------------
       Shepherd:   (was: Benjamin Mahler)
    Description: 
Currently the default value for `disk_watch_interval` is 1 minute. This is not 
fast enough and could lead to the following scenario:

1. The disk usage was checked and there was not enough headroom:
{noformat}
I0901 17:54:33.000000 25510 slave.cpp:5896] Current disk usage 99.87%. Max 
allowed age: 0ns
{noformat}
But no container was pruned because no container had been scheduled for GC.

2. A task was completed. The task itself contained a lot of containers, each 
used a lot of disk space. Note that there is no way for Mesos agent to schedule 
individual nested containers for GC since nested containers are not necessarily 
tied to tasks. When the top-lovel container is completed, it was scheduled for 
GC, and the nested containers would be GC'ed as well: 
{noformat}
I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
'/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
 for gc 1.99999466483852days in the future
I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
'/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
 for gc 1.99999466405037days in the future
I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
'/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
 for gc 1.9999946635763days in the future
I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
'/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
 for gc 1.99999466324148days in the future
{noformat}

3. Since the next disk usage check was still 40 seconds away, no GC was 
performed even though the disk was full. As a result, Mesos agent failed to 
checkpoint the task status:
{noformat}
I0901 17:54:49.000000 25513 status_update_manager.cpp:323] Received status 
update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
F0901 17:54:49.000000 25513 slave.cpp:4748] CHECK_READY(future): is FAILED: 
Failed to open 
'/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
 for status updates: No space left on device Failed to handle status update 
TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
{noformat}

4. When the agent restarted, it tried to checkpoint the task status again. 
However, since the first disk usage check was scheduled 1 minute after startup, 
the agent failed before GC kicked in, falling into a restart failure loop:
{noformat}
F0901 17:55:06.000000 31114 slave.cpp:4748] CHECK_READY(future): is FAILED: 
Failed to open 
'/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
 for status updates: No space left on device Failed to handle status update 
TASK_FAILED (UUID: fb9c3951-9a93-4925-a7f0-9ba7e38d2398) for task 
node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
{noformat}

The disk usage check uses {{statvfs}}, which seems to be pretty cheap. Here are 
the amount of time for 1 million {{statvfs}} calls:
{noformat}
real    0m1.685s
user    0m0.185s
sys    0m1.532s
{noformat}
It seems that we should shorten the default value of {{disk_watch_interval}} to 
avoid the above scenario.

Related ticket: MESOS-7031

  was:
A couple of completed tasks used up the disk space for sandboxes. After that, a 
task is failed due to insufficient disk space for downloading its artifacts:
{noformat}
I0901 17:54:47.402745 31039 fetcher.cpp:222] Fetching URI 
'https://downloads.mesosphere.com/java/jre-8u131-linux-x64.tar.gz'
I0901 17:54:47.402756 31039 fetcher.cpp:165] Downloading resource from 
'https://downloads.mesosphere.com/java/jre-8u131-linux-x64.tar.gz' to 
'/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/containers/218b84a5-0301-4c95-89c5-c863145f86d8/jre-8u131-linux-x64.tar.gz'
E0901 17:54:47.796036 31039 fetcher.cpp:579] EXIT with status 1: Failed to 
fetch 'https://downloads.mesosphere.com/java/jre-8u131-linux-x64.tar.gz': Error 
downloading resource: Failed writing received data to disk/application
{noformat}
As a result, the container is destroyed:
{noformat}
I0901 17:54:48.000000 25508 containerizer.cpp:2599] Container 
602befac-3ff5-44d7-acac-aeebdc0e4666 has exited
I0901 17:54:48.000000 25508 containerizer.cpp:2158] Destroying container 
602befac-3ff5-44d7-acac-aeebdc0e4666 in RUNNING state
I0901 17:54:48.000000 25508 containerizer.cpp:2699] Transitioning the state of 
container 602befac-3ff5-44d7-acac-aeebdc0e4666 from RUNNING to DESTROYING
I0901 17:54:48.000000 25508 linux_launcher.cpp:505] Asked to destroy container 
602befac-3ff5-44d7-acac-aeebdc0e4666
I0901 17:54:48.000000 25508 linux_launcher.cpp:548] Using freezer to destroy 
cgroup mesos/602befac-3ff5-44d7-acac-aeebdc0e4666
I0901 17:54:48.000000 25510 cgroups.cpp:3055] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos
I0901 17:54:48.000000 25508 cgroups.cpp:3055] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666
I0901 17:54:48.000000 25510 cgroups.cpp:1413] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos after 
3.090176ms
I0901 17:54:48.000000 25508 cgroups.cpp:1413] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666 after 
5.518848ms
I0901 17:54:48.000000 25510 cgroups.cpp:3073] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos
I0901 17:54:48.000000 25512 cgroups.cpp:3073] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666
I0901 17:54:48.000000 25512 cgroups.cpp:1442] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666 after 
2.351104ms
I0901 17:54:49.000000 25509 cgroups.cpp:1442] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos after 
106.335232ms
I0901 17:54:49.000000 25512 disk.cpp:320] Checking disk usage at 
'/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/container-path'
 for container 602befac-3ff5-44d7-acac-aeebdc0e4666 has been cancelled
I0901 17:54:49.000000 25511 disk.cpp:320] Checking disk usage at 
'/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666'
 for container 602befac-3ff5-44d7-acac-aeebdc0e4666 has been cancelled
I0901 17:54:49.000000 25511 container_assigner.cpp:101] Unregistering 
container_id[value: "602befac-3ff5-44d7-acac-aeebdc0e4666"].
{noformat}
However, due to insufficient disk space, the agent was unable to store the 
TASK_FAILED state, and this led to an agent crash:
{noformat}
I0901 17:54:49.000000 25513 status_update_manager.cpp:323] Received status 
update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
F0901 17:54:49.000000 25513 slave.cpp:4748] CHECK_READY(future): is FAILED: 
Failed to open 
'/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
 for status updates: No space left on device Failed to handle status update 
TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
{noformat}
When the agent restarted, it tried to destroy this orphan container but failed, 
then crashed again, causing a restart failure loop:
{noformat}
F0901 17:55:06.000000 31114 slave.cpp:4748] CHECK_READY(future): is FAILED: 
Failed to open 
'/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
 for status updates: No space left on device Failed to handle status update 
TASK_FAILED (UUID: fb9c3951-9a93-4925-a7f0-9ba7e38d2398) for task 
node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
{noformat}
To prevent the above from happening, garbage collection should be kicked in 
early enough to clean up the container sandboxes of failed tasks.

Related ticket: MESOS-7031


> Disk usage check for garbage collection 
> ----------------------------------------
>
>                 Key: MESOS-7939
>                 URL: https://issues.apache.org/jira/browse/MESOS-7939
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>            Priority: Critical
>             Fix For: 1.4.1
>
>
> Currently the default value for `disk_watch_interval` is 1 minute. This is 
> not fast enough and could lead to the following scenario:
> 1. The disk usage was checked and there was not enough headroom:
> {noformat}
> I0901 17:54:33.000000 25510 slave.cpp:5896] Current disk usage 99.87%. Max 
> allowed age: 0ns
> {noformat}
> But no container was pruned because no container had been scheduled for GC.
> 2. A task was completed. The task itself contained a lot of containers, each 
> used a lot of disk space. Note that there is no way for Mesos agent to 
> schedule individual nested containers for GC since nested containers are not 
> necessarily tied to tasks. When the top-lovel container is completed, it was 
> scheduled for GC, and the nested containers would be GC'ed as well: 
> {noformat}
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
>  for gc 1.99999466483852days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
>  for gc 1.99999466405037days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc'
>  for gc 1.9999946635763days in the future
> I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e'
>  for gc 1.99999466324148days in the future
> {noformat}
> 3. Since the next disk usage check was still 40 seconds away, no GC was 
> performed even though the disk was full. As a result, Mesos agent failed to 
> checkpoint the task status:
> {noformat}
> I0901 17:54:49.000000 25513 status_update_manager.cpp:323] Received status 
> update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> F0901 17:54:49.000000 25513 slave.cpp:4748] CHECK_READY(future): is FAILED: 
> Failed to open 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
>  for status updates: No space left on device Failed to handle status update 
> TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> 4. When the agent restarted, it tried to checkpoint the task status again. 
> However, since the first disk usage check was scheduled 1 minute after 
> startup, the agent failed before GC kicked in, falling into a restart failure 
> loop:
> {noformat}
> F0901 17:55:06.000000 31114 slave.cpp:4748] CHECK_READY(future): is FAILED: 
> Failed to open 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates'
>  for status updates: No space left on device Failed to handle status update 
> TASK_FAILED (UUID: fb9c3951-9a93-4925-a7f0-9ba7e38d2398) for task 
> node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 
> 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> The disk usage check uses {{statvfs}}, which seems to be pretty cheap. Here 
> are the amount of time for 1 million {{statvfs}} calls:
> {noformat}
> real    0m1.685s
> user    0m0.185s
> sys    0m1.532s
> {noformat}
> It seems that we should shorten the default value of {{disk_watch_interval}} 
> to avoid the above scenario.
> Related ticket: MESOS-7031



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to