[GitHub] [mesos] cf-natali commented on pull request #388: Fixed a bug where the cgroup task killer leaves the cgroup frozen.

GitBox Sat, 29 May 2021 12:48:03 -0700


cf-natali commented on pull request #388:
URL: https://github.com/apache/mesos/pull/388#issuecomment-850889402



   > That said, I'm not convinced that postponing c) is the best available 
option.
   > 
   > Maybe we should consider replacing c) with another termination mechanism 
that will be **guaranteed** not to halt this loop in the middle of an 
iteration? Something like ~setting a flag by that discard callback and checking 
this flag~ checking `future.hasDiscard()` inside this loop before calling 
`cgroups::freezer:freeze()` ?
   
   Yes, I wondered about this, and actually I initially wanted to go for this 
approach to never interrupt of the middle of an iteration, however I chose the 
approach in this MR because:
   - it's simpler
   - I don't feel too comfortable changing this code as it's already the result 
of several workarounds, so as you said earlier it's not clear that any change 
won't break some subtle corner-case
   - but more fundamentally, the reason is that if the freezing fails, then I 
think pretty much all bets are off. For example in case of timeout upon freeze, 
I'm not convinced that the call to `TaskKiller::kill` is safe because by 
definition some tasks are not frozen hence the kill won't be atomic and it 
might be possible that because of PID reuse we end up killing the wrong 
process. Another thing is that if freezing failed, I am not sure how reliable 
the thawing or if the kernel might just discard it which will also leave the 
cgroup in a bad state.
   
   However I'm open to trying it if you prefer the non-interruption approach.
   
   It's so sad that there is no simple and reliable way to do this. AFAICT 
systemd doesn't use the freezer cgroup because of its various issues - it 
actually suffers from PID reuse race conditions, looks like they are 
considering using freezer in cgroupsv2: 
https://github.com/systemd/systemd/issues/13101
   
   And sadly, while PID namespace mentions in the various comments sounds 
promising, the problem is that it could definitely break some tasks so it's not 
like we can just switch to using it.
    
   > Btw, how are you testing the code handling freeze/kill failures?
   > In my experience, the simplest way to reliably create a process in an 
uninterruptable sleep (D state) is to create a file on some FUSE-backed FS 
(say, sshfs), mmap that file, stop the program backing the FS with SIGSTOP, and 
repeatedly read/write the mmap-ed bytes until the process eventually hangs. 
That's rather complicated, takes 30 seconds of read/write on my kernel and is 
not really usable in tests... I'm wondering if there is a simpler option.
   
   It depends.
   If the goal is really to create a process in uninterruptible sleep, then yes 
I'm not aware of any way much simpler - typically I'd use a write to an hard 
NFS mount, and use e.g. iptables to drop the packets and cause the process to 
hang.
   
   However I don't believe it's quite necessary to test the freezer code.
   For this change, I just re-ran some cgroups related tests in a loop because 
it was enough to trigger the issue.
   
   For the previous changes I've made to the freezer code - some workarounds 
for kernel bugs - I'd generally just use `strace --inject` to either introduce 
faults or delays, e.g.:
   ```
   cf@thinkpad:~$ strace -ttT --inject=read:delay_enter=1s /bin/cat /etc/fstab 
>/dev/null 
   [...]
   20:34:21.932568 openat(AT_FDCWD, "/etc/fstab", O_RDONLY) = 3 <0.000041>
   [...]
   20:34:21.932936 read(3, "# /etc/fstab: static file system"..., 131072) = 788 
<1.000289>
   20:34:22.933402 write(1, "# /etc/fstab: static file system"..., 788) = 788 
<0.000037>
   20:34:22.933580 read(3, "", 131072)     = 0 <1.000115>
   20:34:23.933857 munmap(0x7ff03b46b000, 139264) = 0 <0.000074>
   [...]
   ```
   
   This can be used to e.g. cause timeouts while reading/writing to 
`freezer.state` file, which is enough to exercise some interesting code paths.
   And I quite like it because it's very generic and can be used for many 
problems.
   
   For something purely freezer-specific, another way I can think of to trigger 
freezer errors, and which can actually probably be done as part of an automated 
test, is to use nested cgroups:
   ```
   root@thinkpad:~# # create parent cgroup
   root@thinkpad:~# mkdir /sys/fs/cgroup/freezer/parent
   root@thinkpad:~# # create child cgroup
   root@thinkpad:~# mkdir /sys/fs/cgroup/freezer/parent/child
   root@thinkpad:~# # freeze parent - since freezing is recursive, the child 
will be frozen
   root@thinkpad:~# echo FROZEN > /sys/fs/cgroup/freezer/parent/freezer.state
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/freezer.state 
   FROZEN
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/child/freezer.state 
   FROZEN
   root@thinkpad:~# # attempt to thaw the child
   root@thinkpad:~# echo THAWED > 
/sys/fs/cgroup/freezer/parent/child/freezer.state 
   root@thinkpad:~# # not working since parent is still frozen
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/child/freezer.state 
   FROZEN
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/freezer.state 
   FROZEN
   root@thinkpad:~# # thaw parent, and child thaws as well
   root@thinkpad:~# echo THAWED > /sys/fs/cgroup/freezer/parent/freezer.state 
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/freezer.state 
   THAWED
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/child/freezer.state 
   THAWED
   root@thinkpad:~# 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [mesos] cf-natali commented on pull request #388: Fixed a bug where the cgroup task killer leaves the cgroup frozen.

Reply via email to