On 2026-03-29, Jori Koolstra <[email protected]> wrote: > > Unfortunately there is a pretty big issue with doing it this way (which > > I mentioned in one of the changelogs back then[2]): There are lots of > > VFS operations that imply operations on a file (through a magic-link) > > that are not blocked. truncate(2) and mount(MS_BIND)/open_tree(2) are > > the most problematic examples, but this applies to basically any syscall > > that takes a path argument. If you don't block those then re-opening > > restrictions are functionally useless. > > Ah yes. If I am correct, it would block all the fXXX syscalls from doing > harm (at least w.r.t. read/write operations) because they use the fmode for > rights checking on the fd, and this cannot be changed without going through > an open() variant. > > Hence the issue is the case when we pass a magic path to e.g. truncate() > as it does no upgrade restriction check right now on the struct file. So > we hade to do this for every relevant syscall. And the question is... where.
They would also be bypassed by AT_EMPTY_PATH, so you don't even need magic-links to cause the issue. This is why I considered doing it with "struct path" and doing the blocking in *_permission() -- all of the standard vfs operations already have . > > It also would be really nice if you could block more than just trailing > > component operations -- having a directory file descriptor that blocks > > lookups could be quite handy for a bunch of reasons. > > Yes, so (at the very least) we also want RESTRICT_LOOKUP for directory fds. Well, definitely not initially. You also don't really need to call it RESTRICT_LOOKUP, blocking exec would be the "unixy" thing to do (and it would be nice to have that to block fexecveat(2) too, but that is a _VERY_ deep rabbithole). > > I think the only workable solution to block all of these issue entirely > > and in a comprehensive way is to have something akin to capsicum > > capabilities[3] tied to file descriptors and have all of the VFS > > operations check them (though I think that the way this was attempted in > > the past[4] was far from ideal). > > > > I went through Drysdale's implementation a bit. He links the capability check > to the translation of an fd to a struct file. I agree this is a bit invasive > (as he writes himself), and perhaps we can do better. Is this what you mean > by "far from ideal"? There were quite a few other issues with it, but that is one of them. > > I have tried my hand at a few lighter-weight prototypes over the years > > (mainly trying to add the necessary checks to every generic_permission() > > call, and adding some more generic_permission() calls as well...). My > > last prototype was adding the restriction information to "struct path" > > but that would bloat too many structures to be merge-able. I was > > planning on looking at this again later this year, but if you can come > > up with a nice way of getting a minimal version of capsicum working, > > that'd be amazing. :D > > I would really like to try; it is a very nice problem for me to tackle; > you need to gain experience somehow :) Sorry, I probably should've couched my reply with a bit more caution -- I really appreciate the enthusiasm, but this is quite a hairy topic for quite a number of reasons. It will involve touching everything fd-related in the kernel, it touches on an already-rejected patchset, and is a topic that can get very heated easily. I would not rate the chance of getting something merged very highly, and I think there are more fruitful things for you to tackle and get your hands dirty. Honestly, I think most people will be more than happy with just getting the O_EMPTYPATH part merged -- looking back, I really should've just done that back in 2019. :/ > I wonder how checking all this in generic_permission() would work. The > access to the fd that the procfs magic link provides is essentially an > issue of path traversal, and in generic_permission() you just have the > inode in question. Ah but of course, you can use the mode bits of the > magic link to encode the information, as you suggest. What downside did > you encounter using this idea? It wasn't based on the magic-link mode, I added a capability set to "struct path" and then went and adjusted all generic_permission()s to take the new capability mask. For stuff that did file_permission() or path_permission(), no code change was needed. Unfortunately this ran into quite a few hairy issues that made it basically unmergeable (bloating "struct path" is very bad as it is embedded everywhere, the whole model is kind of questionable because it makes "struct path" have state about the lookup that produced it, some VFS APIs deal with inodes or dentries directly and thus would need more tree-wide fixes, and most importantly it just felt really ugly). The open_tree(2) / mount(MS_BIND) issue is particularly problematic -- ideally you would want to block a RW bind-mount for a read-only file, but there isn't really an obvious mechanism for passing down those kinds of restrictions. Actually, fsconfig() arguments like "upperdir" to overlayfs (or even "source") are likely even worse -- now you need per-filesystem-option handling. I don't think these issues are insurmountable, it's just that the problem is much harder than it looks at first glance and I would humbly suggest that working on reviving a very dead patchset is not the best use of your time. Container runtimes really *really* want this, which is the main reason I keep coming back to it (and we ended up with it in the UAPI group proposal page) but I think it's a very niche thing that has a big risk of being a lot of wasted effort. -- Aleksa Sarai https://www.cyphar.com/
signature.asc
Description: PGP signature

