[ 
https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084136#comment-17084136
 ] 

Andrei Budnik commented on MESOS-10107:
---------------------------------------

{code:java}
commit af3ca189aced5fbc537bfca571264142d4cd37b3
Author: Charles-Francois Natali <cf.nat...@gmail.com>
Date:   Wed Apr 1 13:40:16 2020 +0100

    Handled EBUSY when destroying a cgroup.

    It's a workaround for kernel bugs which can cause `rmdir` to fail with
    `EBUSY` even though the cgroup - appears - empty.
    See for example https://lkml.org/lkml/2020/1/15/1349

    This closes #355
{code}

> containeriser: failed to remove cgroup - EBUSY
> ----------------------------------------------
>
>                 Key: MESOS-10107
>                 URL: https://issues.apache.org/jira/browse/MESOS-10107
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Charles N
>            Priority: Major
>              Labels: cgroups, containerization
>             Fix For: 1.10.0
>
>         Attachments: mesos-remove-cgroup-race.diff, 
> reproduce-cgroup-rmdir-race.py
>
>
> We've been seeing some random errors on our cluster, where the container 
> cgroup isn't properly destroyed after the OOM killer kicked in when memory 
> limit has been exceeded - see analysis and patch below:
> Agent log:
> {noformat}
> I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 
> 10272MB Maximum Used: 10518532KB
> MEMORY STATISTICS: 
> cache 0
> rss 10502754304
> rss_huge 4001366016
> shmem 0
> mapped_file 270336
> dirty 0
> writeback 0
> swap 0
> pgpgin 1684617
> pgpgout 95480
> pgfault 1670328
> pgmajfault 957
> inactive_anon 0
> active_anon 10501189632
> inactive_file 4096
> active_file 0
> unevictable 0
> hierarchical_memory_limit 10770972672
> hierarchical_memsw_limit 10770972672
> total_cache 0
> total_rss 10502754304
> total_rss_huge 4001366016
> total_shmem 0
> total_mapped_file 270336
> total_dirty 0
> total_writeback 0
> total_swap 0
> total_pgpgin 1684617
> total_pgpgout 95480
> total_pgfault 1670328
> total_pgmajfault 957
> total_inactive_anon 0
> total_active_anon 10501070848
> total_inactive_file 4096
> total_active_file 0
> total_unevictable 0
> I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource 
> [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be 
> terminated
> I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
> I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state 
> of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING 
> after 4.285078272secs
> I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy 
> container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
> I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 102.27072ms
> I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 242944ns
> I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for 
> executor(1)@127.0.1.1:46357
> I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
> E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 
> 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 
> 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device 
> or resource busy
> {noformat}
> Initially I thought it was a race condition in the cgroup destruction code, 
> but an strace confirmed that the cgroup directory was only deleted once all 
> tasks had exited (edited and commented strace below from a different instance 
> of the same problem):
> {noformat}
> # get the list of processes
> 3431  23:01:32.293608 openat(AT_FDCWD,
> "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
> O_RDONLY <unfinished ...>
> 3431  23:01:32.293669 <... openat resumed> ) = 18 <0.000036>
> 3431  23:01:32.294220 read(18,  <unfinished ...>
> 3431  23:01:32.294268 <... read resumed> "5878\n6036\n6210\n", 8192) =
> 15 <0.000033>
> 3431  23:01:32.294306 read(18, "", 4096) = 0 <0.000013>
> 3431  23:01:32.294346 close(18 <unfinished ...>
> 3431  23:01:32.294402 <... close resumed> ) = 0 <0.000045>
> #kill them
> 3431  23:01:32.296266 kill(5878, SIGKILL) = 0 <0.000019>
> 3431  23:01:32.296384 kill(6036, SIGKILL <unfinished ...>
> 3431  23:01:32.296474 kill(6210, SIGKILL) = 0 <0.000013>
> 6210  23:01:32.299159 +++ killed by SIGKILL +++
> 5878  23:01:32.327551 +++ killed by SIGKILL +++
> 3431  23:01:32.327601 --- SIGCHLD {si_signo=SIGCHLD,
> si_code=CLD_KILLED, si_pid=5878, si_uid=1000, si_status=SIGKILL,
> si_utime=18, si_stime=4} ---
> 6036  23:01:32.327923 +++ killed by SIGKILL +++
> # wait them
> 3429  23:01:32.338912 wait4(5878,  <unfinished ...>
> 3429  23:01:32.339078 <... wait4 resumed> [{WIFSIGNALED(s) &&
> WTERMSIG(s) == SIGKILL}], WNOHANG, NULL) = 5878 <0.000157>
> 3429  23:01:32.339944 wait4(6036, 0x7fcfb28ee5e0, WNOHANG, NULL) = -1
> ECHILD (Aucun processus enfant) <0.000005>
> 3429  23:01:32.340938 wait4(6210,  <unfinished ...>
> 3429  23:01:32.341165 <... wait4 resumed> 0x7fcfb28ee5e0, WNOHANG,
> NULL) = -1 ECHILD (Aucun processus enfant) <0.000218>
> # check cgroup empty
> 3426  23:01:32.341490 openat(AT_FDCWD,
> "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
> O_RDONLY <unfinished ...>
> 3426  23:01:32.342474 read(47,  <unfinished ...>
> 3426  23:01:32.342842 <... read resumed> "", 8192) = 0 <0.000359>
> 3426  23:01:32.342940 close(47 <unfinished ...>
> 3426  23:01:32.344579 <... close resumed> ) = 0 <0.001631>
> # cgroup empty - attempt to delete
> 3428  23:01:32.345596
> rmdir("/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb"
> <unfinished ...>
> 3428  23:01:32.345679 <... rmdir resumed> ) = -1 EBUSY (Device or resource 
> busy) <0.000068>
> {noformat}
> So as can be seen, the cgroup destruction code correctly waits until the 
> tasks has exited to destroy the cgroup, which is empty as can be seen by the 
> code reading from cgroup.procs right before the rmdir.
> It turns out that it's actually a kernel bug recently reported and fixed: 
> https://lkml.org/lkml/2020/1/15/1349
> See reproducer attached - it repeatedly starts a task consuming several GB of 
> memory, with a memory limit below slightly below to trigger OOM.
> It's a kernel bug, but since it affects virtually every kernel/Mesos 
> currently in use I think it would make sense to add a workaround - simply 
> retrying to delete the cgroup after a timeout fixes it for me, see patch 
> attached.
> [~qianzhang] [~abudnik]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to