On 2026-03-26, Jori Koolstra <[email protected]> wrote: > Add upgrade restrictions to openat2(). Extend struct open_how to allow > setting transitive restrictions on using file descriptors to open other > files. A use case for this feature is to block services or containers > from re-opening/upgrading an O_PATH file descriptor through e.g. > /proc/<pid>/fd/<nr> as O_WRONLY. > > The idea for this features comes form the UAPI group kernel feature idea > list [1]. > > [1] > https://github.com/uapi-group/kernel-features?tab=readme-ov-file#upgrade-masks-in-openat2
I had a version of this in the original openat2(2) pull request many years ago[1]. Unfortunately there is a pretty big issue with doing it this way (which I mentioned in one of the changelogs back then[2]): There are lots of VFS operations that imply operations on a file (through a magic-link) that are not blocked. truncate(2) and mount(MS_BIND)/open_tree(2) are the most problematic examples, but this applies to basically any syscall that takes a path argument. If you don't block those then re-opening restrictions are functionally useless. It also would be really nice if you could block more than just trailing component operations -- having a directory file descriptor that blocks lookups could be quite handy for a bunch of reasons. I think the only workable solution to block all of these issue entirely and in a comprehensive way is to have something akin to capsicum capabilities[3] tied to file descriptors and have all of the VFS operations check them (though I think that the way this was attempted in the past[4] was far from ideal). I have tried my hand at a few lighter-weight prototypes over the years (mainly trying to add the necessary checks to every generic_permission() call, and adding some more generic_permission() calls as well...). My last prototype was adding the restriction information to "struct path" but that would bloat too many structures to be merge-able. I was planning on looking at this again later this year, but if you can come up with a nice way of getting a minimal version of capsicum working, that'd be amazing. :D That being said, while my view at the time of openat2(2) was that we need to do this at the same time as O_EMPTYPATH (and my tests showed that this was a backwards-compatible change -- on modern desktops at least), at this point I think it'd be better to just merge O_EMPTYPATH by itself and we can work on this hardening separately and make it an opt-in sysctl (with individual file descriptors being opt-in-able as well). [1]: https://lore.kernel.org/lkml/[email protected]/ [2]: https://lore.kernel.org/lkml/[email protected]/ [3]: https://lwn.net/Articles/482858/ [4]: https://lore.kernel.org/lkml/[email protected]/ > +const struct jump_how jump_how_unrestricted = { > + .allowed_upgrades = VALID_UPGRADE_FLAGS > +}; > + > /* > * Helper to directly jump to a known parsed path from ->get_link, > * caller must have taken a reference to path beforehand. > */ > -int nd_jump_link(const struct path *path) > +int nd_jump_link_how(const struct path *path, const struct jump_how *how) > { > int error = -ELOOP; > struct nameidata *nd = current->nameidata; > @@ -1181,6 +1187,7 @@ int nd_jump_link(const struct path *path) > nd->path = *path; > nd->inode = nd->path.dentry->d_inode; > nd->state |= ND_JUMPED; > + nd->allowed_upgrades &= how->allowed_upgrades; > return 0; Way back then, Andy Lutomirski suggested that this be done via the magic-link modes. While it is kind of ugly and in my patchset this required adjusting some magic-link modes, it does provide a useful indication to userspace of two things: - What upgrade modes are available for a file (this is useful for debugging but is also really necessary for the checkpoint-restore folks' needs). I did this with fmode (and exposed fmode in fdinfo) but I would not recommend that approach at all. - It indicates whether the kernel supports this feature, which will allow certain programs to loosen their hardening logic since the kernel implements the hardening for them. For instance, most container runtimes now either make a copy of /proc/self/exe as a sealed memfd or create a read-only overlayfs mount to re-exec /proc/self/exe so that containers cannot overwrite the host binary by doing a /proc/$pid/exe re-open. See CVE-2019-5736 for more details. Indicating that the kernel blocks this attack would let container runtimes disable this hardening on such kernels. Maybe we should at least change the modes even if they aren't used? > -static int proc_exe_link(struct dentry *dentry, struct path *exe_path) > +static int proc_exe_link(struct dentry *dentry, struct path *exe_path, > + struct jump_how *jump_how) > { > struct task_struct *task; > struct file *exe_file; > @@ -1789,6 +1794,7 @@ static int proc_exe_link(struct dentry *dentry, struct > path *exe_path) > put_task_struct(task); > if (exe_file) { > *exe_path = exe_file->f_path; > + *jump_how = jump_how_unrestricted; > path_get(&exe_file->f_path); > fput(exe_file); > return 0; This should restrict writes, for the reasons outlined above. > -static int map_files_get_link(struct dentry *dentry, struct path *path) > +static int map_files_get_link(struct dentry *dentry, struct path *path, > + struct jump_how *jump_how) > { > unsigned long vm_start, vm_end; > struct vm_area_struct *vma; > @@ -2279,6 +2288,7 @@ static int map_files_get_link(struct dentry *dentry, > struct path *path) > rc = -ENOENT; > vma = find_exact_vma(mm, vm_start, vm_end); > if (vma && vma->vm_file) { > + *jump_how = jump_how_unrestricted; > *path = *file_user_path(vma->vm_file); > path_get(path); > rc = 0; This should also restrict writes (this is a similar issue to /proc/self/exe and is harder to harden against, the primary defense is just ASLR...). > diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h > index c34f32e6fa96..fc1147e6ce41 100644 > --- a/include/uapi/linux/openat2.h > +++ b/include/uapi/linux/openat2.h > @@ -20,8 +20,14 @@ struct open_how { > __u64 flags; > __u64 mode; > __u64 resolve; > + __u64 allowed_upgrades; > }; > > +/* how->allowed_upgrades flags for openat2(2). */ > +#define DENY_UPGRADES 0x01 > +#define READ_UPGRADABLE (0x02 | DENY_UPGRADES) > +#define WRITE_UPGRADABLE (0x04 | DENY_UPGRADES) I'm not a huge fan of how this bitmask is set up, to be honest. I get that you did it this way to make it disable-by-default but given that we probably will want to add restrictions in the future that would break backward compatibility (imagine an execute restriction that blocks execve(2) -- adding the feature would break all existing programs if you follow this scheme). It probably makes more sense to do something more like statx(2) -- you pick the restrictions you want and get back information about which restrictions were supported. This is similar to what I did in the old patchset too, though it didn't give you information like statx(2) does. -- Aleksa Sarai https://www.cyphar.com/
signature.asc
Description: PGP signature

