[ https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072739#comment-17072739 ]
Charles commented on MESOS-10107: --------------------------------- Hey! Sure, I created [https://github.com/apache/mesos/pull/355] I answered your questions there (I'm not sure what's the best place for discussions, but it feels nice to have it close to thew code review). > containeriser: failed to remove cgroup - EBUSY > ---------------------------------------------- > > Key: MESOS-10107 > URL: https://issues.apache.org/jira/browse/MESOS-10107 > Project: Mesos > Issue Type: Bug > Components: containerization > Reporter: Charles > Priority: Major > Attachments: mesos-remove-cgroup-race.diff, > reproduce-cgroup-rmdir-race.py > > > We've been seeing some random errors on our cluster, where the container > cgroup isn't properly destroyed after the OOM killer kicked in when memory > limit has been exceeded - see analysis and patch below: > Agent log: > {noformat} > I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: > 10272MB Maximum Used: 10518532KB > MEMORY STATISTICS: > cache 0 > rss 10502754304 > rss_huge 4001366016 > shmem 0 > mapped_file 270336 > dirty 0 > writeback 0 > swap 0 > pgpgin 1684617 > pgpgout 95480 > pgfault 1670328 > pgmajfault 957 > inactive_anon 0 > active_anon 10501189632 > inactive_file 4096 > active_file 0 > unevictable 0 > hierarchical_memory_limit 10770972672 > hierarchical_memsw_limit 10770972672 > total_cache 0 > total_rss 10502754304 > total_rss_huge 4001366016 > total_shmem 0 > total_mapped_file 270336 > total_dirty 0 > total_writeback 0 > total_swap 0 > total_pgpgin 1684617 > total_pgpgout 95480 > total_pgfault 1670328 > total_pgmajfault 957 > total_inactive_anon 0 > total_active_anon 10501070848 > total_inactive_file 4096 > total_active_file 0 > total_unevictable 0 > I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource > [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be > terminated > I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state > I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state > of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING > after 4.285078272secs > I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy > container 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c' > I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 102.27072ms > I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 242944ns > I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for > executor(1)@127.0.1.1:46357 > I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited > E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor > 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework > 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all > processes in the container: Failed to remove cgroup > 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device > or resource busy > {noformat} > Initially I thought it was a race condition in the cgroup destruction code, > but an strace confirmed that the cgroup directory was only deleted once all > tasks had exited (edited and commented strace below from a different instance > of the same problem): > {noformat} > # get the list of processes > 3431 23:01:32.293608 openat(AT_FDCWD, > "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs", > O_RDONLY <unfinished ...> > 3431 23:01:32.293669 <... openat resumed> ) = 18 <0.000036> > 3431 23:01:32.294220 read(18, <unfinished ...> > 3431 23:01:32.294268 <... read resumed> "5878\n6036\n6210\n", 8192) = > 15 <0.000033> > 3431 23:01:32.294306 read(18, "", 4096) = 0 <0.000013> > 3431 23:01:32.294346 close(18 <unfinished ...> > 3431 23:01:32.294402 <... close resumed> ) = 0 <0.000045> > #kill them > 3431 23:01:32.296266 kill(5878, SIGKILL) = 0 <0.000019> > 3431 23:01:32.296384 kill(6036, SIGKILL <unfinished ...> > 3431 23:01:32.296474 kill(6210, SIGKILL) = 0 <0.000013> > 6210 23:01:32.299159 +++ killed by SIGKILL +++ > 5878 23:01:32.327551 +++ killed by SIGKILL +++ > 3431 23:01:32.327601 --- SIGCHLD {si_signo=SIGCHLD, > si_code=CLD_KILLED, si_pid=5878, si_uid=1000, si_status=SIGKILL, > si_utime=18, si_stime=4} --- > 6036 23:01:32.327923 +++ killed by SIGKILL +++ > # wait them > 3429 23:01:32.338912 wait4(5878, <unfinished ...> > 3429 23:01:32.339078 <... wait4 resumed> [{WIFSIGNALED(s) && > WTERMSIG(s) == SIGKILL}], WNOHANG, NULL) = 5878 <0.000157> > 3429 23:01:32.339944 wait4(6036, 0x7fcfb28ee5e0, WNOHANG, NULL) = -1 > ECHILD (Aucun processus enfant) <0.000005> > 3429 23:01:32.340938 wait4(6210, <unfinished ...> > 3429 23:01:32.341165 <... wait4 resumed> 0x7fcfb28ee5e0, WNOHANG, > NULL) = -1 ECHILD (Aucun processus enfant) <0.000218> > # check cgroup empty > 3426 23:01:32.341490 openat(AT_FDCWD, > "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs", > O_RDONLY <unfinished ...> > 3426 23:01:32.342474 read(47, <unfinished ...> > 3426 23:01:32.342842 <... read resumed> "", 8192) = 0 <0.000359> > 3426 23:01:32.342940 close(47 <unfinished ...> > 3426 23:01:32.344579 <... close resumed> ) = 0 <0.001631> > # cgroup empty - attempt to delete > 3428 23:01:32.345596 > rmdir("/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb" > <unfinished ...> > 3428 23:01:32.345679 <... rmdir resumed> ) = -1 EBUSY (Device or resource > busy) <0.000068> > {noformat} > So as can be seen, the cgroup destruction code correctly waits until the > tasks has exited to destroy the cgroup, which is empty as can be seen by the > code reading from cgroup.procs right before the rmdir. > It turns out that it's actually a kernel bug recently reported and fixed: > https://lkml.org/lkml/2020/1/15/1349 > See reproducer attached - it repeatedly starts a task consuming several GB of > memory, with a memory limit below slightly below to trigger OOM. > It's a kernel bug, but since it affects virtually every kernel/Mesos > currently in use I think it would make sense to add a workaround - simply > retrying to delete the cgroup after a timeout fixes it for me, see patch > attached. > [~qianzhang] [~abudnik] -- This message was sent by Atlassian Jira (v8.3.4#803005)