On 2026-03-26, Jori Koolstra <[email protected]> wrote:
> Add upgrade restrictions to openat2(). Extend struct open_how to allow
> setting transitive restrictions on using file descriptors to open other
> files. A use case for this feature is to block services or containers
> from re-opening/upgrading an O_PATH file descriptor through e.g.
> /proc/<pid>/fd/<nr> as O_WRONLY.
> 
> The idea for this features comes form the UAPI group kernel feature idea
> list [1].
> 
> [1] 
> https://github.com/uapi-group/kernel-features?tab=readme-ov-file#upgrade-masks-in-openat2

I had a version of this in the original openat2(2) pull request many
years ago[1].

Unfortunately there is a pretty big issue with doing it this way (which
I mentioned in one of the changelogs back then[2]): There are lots of
VFS operations that imply operations on a file (through a magic-link)
that are not blocked. truncate(2) and mount(MS_BIND)/open_tree(2) are
the most problematic examples, but this applies to basically any syscall
that takes a path argument. If you don't block those then re-opening
restrictions are functionally useless.

It also would be really nice if you could block more than just trailing
component operations -- having a directory file descriptor that blocks
lookups could be quite handy for a bunch of reasons.

I think the only workable solution to block all of these issue entirely
and in a comprehensive way is to have something akin to capsicum
capabilities[3] tied to file descriptors and have all of the VFS
operations check them (though I think that the way this was attempted in
the past[4] was far from ideal).

I have tried my hand at a few lighter-weight prototypes over the years
(mainly trying to add the necessary checks to every generic_permission()
call, and adding some more generic_permission() calls as well...). My
last prototype was adding the restriction information to "struct path"
but that would bloat too many structures to be merge-able. I was
planning on looking at this again later this year, but if you can come
up with a nice way of getting a minimal version of capsicum working,
that'd be amazing. :D

That being said, while my view at the time of openat2(2) was that we
need to do this at the same time as O_EMPTYPATH (and my tests showed
that this was a backwards-compatible change -- on modern desktops at
least), at this point I think it'd be better to just merge O_EMPTYPATH
by itself and we can work on this hardening separately and make it an
opt-in sysctl (with individual file descriptors being opt-in-able as
well).

[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://lore.kernel.org/lkml/[email protected]/
[3]: https://lwn.net/Articles/482858/
[4]: 
https://lore.kernel.org/lkml/[email protected]/

> +const struct jump_how jump_how_unrestricted = {
> +     .allowed_upgrades = VALID_UPGRADE_FLAGS
> +};
> +
>  /*
>   * Helper to directly jump to a known parsed path from ->get_link,
>   * caller must have taken a reference to path beforehand.
>   */
> -int nd_jump_link(const struct path *path)
> +int nd_jump_link_how(const struct path *path, const struct jump_how *how)
>  {
>       int error = -ELOOP;
>       struct nameidata *nd = current->nameidata;
> @@ -1181,6 +1187,7 @@ int nd_jump_link(const struct path *path)
>       nd->path = *path;
>       nd->inode = nd->path.dentry->d_inode;
>       nd->state |= ND_JUMPED;
> +     nd->allowed_upgrades &= how->allowed_upgrades;
>       return 0;

Way back then, Andy Lutomirski suggested that this be done via the
magic-link modes. While it is kind of ugly and in my patchset this
required adjusting some magic-link modes, it does provide a useful
indication to userspace of two things:

 - What upgrade modes are available for a file (this is useful for
   debugging but is also really necessary for the checkpoint-restore
   folks' needs). I did this with fmode (and exposed fmode in fdinfo)
   but I would not recommend that approach at all.

 - It indicates whether the kernel supports this feature, which will
   allow certain programs to loosen their hardening logic since the
   kernel implements the hardening for them.

   For instance, most container runtimes now either make a copy of
   /proc/self/exe as a sealed memfd or create a read-only overlayfs
   mount to re-exec /proc/self/exe so that containers cannot overwrite
   the host binary by doing a /proc/$pid/exe re-open. See CVE-2019-5736
   for more details.

   Indicating that the kernel blocks this attack would let container
   runtimes disable this hardening on such kernels.

Maybe we should at least change the modes even if they aren't used?

> -static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
> +static int proc_exe_link(struct dentry *dentry, struct path *exe_path,
> +                      struct jump_how *jump_how)
>  {
>       struct task_struct *task;
>       struct file *exe_file;
> @@ -1789,6 +1794,7 @@ static int proc_exe_link(struct dentry *dentry, struct 
> path *exe_path)
>       put_task_struct(task);
>       if (exe_file) {
>               *exe_path = exe_file->f_path;
> +             *jump_how = jump_how_unrestricted;
>               path_get(&exe_file->f_path);
>               fput(exe_file);
>               return 0;

This should restrict writes, for the reasons outlined above.

> -static int map_files_get_link(struct dentry *dentry, struct path *path)
> +static int map_files_get_link(struct dentry *dentry, struct path *path,
> +                           struct jump_how *jump_how)
>  {
>       unsigned long vm_start, vm_end;
>       struct vm_area_struct *vma;
> @@ -2279,6 +2288,7 @@ static int map_files_get_link(struct dentry *dentry, 
> struct path *path)
>       rc = -ENOENT;
>       vma = find_exact_vma(mm, vm_start, vm_end);
>       if (vma && vma->vm_file) {
> +             *jump_how = jump_how_unrestricted;
>               *path = *file_user_path(vma->vm_file);
>               path_get(path);
>               rc = 0;

This should also restrict writes (this is a similar issue to
/proc/self/exe and is harder to harden against, the primary defense is
just ASLR...).

> diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
> index c34f32e6fa96..fc1147e6ce41 100644
> --- a/include/uapi/linux/openat2.h
> +++ b/include/uapi/linux/openat2.h
> @@ -20,8 +20,14 @@ struct open_how {
>       __u64 flags;
>       __u64 mode;
>       __u64 resolve;
> +     __u64 allowed_upgrades;
>  };
>  
> +/* how->allowed_upgrades flags for openat2(2). */
> +#define DENY_UPGRADES                0x01
> +#define READ_UPGRADABLE              (0x02 | DENY_UPGRADES)
> +#define WRITE_UPGRADABLE     (0x04 | DENY_UPGRADES)

I'm not a huge fan of how this bitmask is set up, to be honest. I get
that you did it this way to make it disable-by-default but given that we
probably will want to add restrictions in the future that would break
backward compatibility (imagine an execute restriction that blocks
execve(2) -- adding the feature would break all existing programs if you
follow this scheme).

It probably makes more sense to do something more like statx(2) -- you
pick the restrictions you want and get back information about which
restrictions were supported. This is similar to what I did in the old
patchset too, though it didn't give you information like statx(2) does.

-- 
Aleksa Sarai
https://www.cyphar.com/

Attachment: signature.asc
Description: PGP signature

Reply via email to