On Sat, 2025-11-29 at 14:41 +0530, Bhavik Sachdev wrote:
> We would like to add support for checkpoint/restoring file descriptors
> open on these "unmounted" mounts to CRIU (Checkpoint/Restore in
> Userspace) [1].
>
> Currently, we have no way to get mount info for these "unmounted" mounts
> since they do appear in /proc/<pid>/mountinfo and statmount does not
> work on them, since they do not belong to any mount namespace.
>
> This patch helps us by providing a way to get mountinfo for these
> "unmounted" mounts by using a fd on the mount.
>
> Changes from v6 [2] to v7:
> * Add kselftests for STATMOUNT_BY_FD flag.
>
> * Instead of renaming mnt_id_req.mnt_ns_fd to mnt_id_req.fd introduce a
> union so struct mnt_id_req looks like this:
>
> struct mnt_id_req {
> __u32 size;
> union {
> __u32 mnt_ns_fd;
> __u32 mnt_fd;
> };
> __u64 mnt_id;
> __u64 param;
> __u64 mnt_ns_id;
> };
>
> * In case of STATMOUNT_BY_FD grab mnt_ns inside of do_statmount(),
> since we get mnt_ns from mnt, which should happen under namespace lock.
>
> * Remove the modifications made to grab_requested_mnt_ns, those were
> never needed.
>
> Changes from v5 [3] to v6:
> * Instead of returning "[unmounted]" as the mount point for "unmounted"
> mounts, we unset the STATMOUNT_MNT_POINT flag in statmount.mask.
>
> * Instead of returning 0 as the mnt_ns_id for "unmounted" mounts, we
> unset the STATMOUNT_MNT_NS_ID flag in statmount.mask.
>
> * Added comment in `do_statmount` clarifying that the caller sets s->mnt
> in case of STATMOUNT_BY_FD.
>
> * In `do_statmount` move the mnt_ns_id and mnt_ns_empty() check just
> before lookup_mnt_in_ns().
>
> * We took another look at the capability checks for getting information
> for "unmounted" mounts using an fd and decided to remove them for the
> following reasons:
>
> - All fs related information is available via fstatfs() without any
> capability check.
>
> - Mount information is also available via /proc/pid/mountinfo (without
> any capability check).
>
> - Given that we have access to a fd on the mount which tells us that
> we had access to the mount at some point (or someone that had access
> gave us the fd). So, we should be able to access mount info.
>
> Changes from v4 [4] to v5:
> Check only for s->root.mnt to be NULL instead of checking for both
> s->root.mnt and s->root.dentry (I did not find a case where only one of
> them would be NULL).
>
> * Only allow system root (CAP_SYS_ADMIN in init_user_ns) to call
> statmount() on fd's on "unmounted" mounts. We (mostly Pavel) spent some
> time thinking about how our previous approach (of checking the opener's
> file credentials) caused problems.
>
> Please take a look at the linked pictures they describe everything more
> clearly.
>
> Case 1: A fd is on a normal mount (Link to Picture: [5])
> Consider, a situation where we have two processes P1 and P2 and a file
> F1. F1 is opened on mount ns M1 by P1. P1 is nested inside user
> namespace U1 and U2. P2 is also in U1. P2 is also in a pid namespace and
> mount namespace separate from M1.
>
> P1 sends F1 to P2 (using a unix socket). But, P2 is unable to call
> statmount() on F1 because since it is a separate pid and mount
> namespace. This is good and expected.
>
> Case 2: A fd is on a "unmounted" mount (Link to Picture: [6])
> Consider a similar situation as Case 1. But now F1 is on a mounted that
> has been "unmounted". Now, since we used openers credentials to check
> for permissions P2 ends up having the ability call statmount() and get
> mount info for this "unmounted" mount.
>
> Hence, It is better to restrict the ability to call statmount() on fds
> on "unmounted" mounts to system root only (There could also be other
> cases than the one described above).
>
> Changes from v3 [7] to v4:
> * Change the string returned when there is no mountpoint to be
> "[unmounted]" instead of "[detached]".
> * Remove the new DEFINE_FREE put_file and use the one already present in
> include/linux/file.h (fput) [8].
> * Inside listmount consistently pass 0 in flags to copy_mnt_id_req and
> prepare_klistmount()->grab_requested_mnt_ns() and remove flags from the
> prepare_klistmount prototype.
> * If STATMOUNT_BY_FD is set, check for mnt_ns_id == 0 && mnt_id == 0.
>
> Changes from v2 [9] to v3:
> * Rename STATMOUNT_FD flag to STATMOUNT_BY_FD.
> * Fixed UAF bug caused by the reference to fd_mount being bound by scope
> of CLASS(fd_raw, f)(kreq.fd) by using fget_raw instead.
> * Reused @spare parameter in mnt_id_req instead of adding new fields to
> the struct.
>
> Changes from v1 [10] to v2:
> v1 of this patchset, took a different approach and introduced a new
> umount_mnt_ns, to which "unmounted" mounts would be moved to (instead of
> their namespace being NULL) thus allowing them to be still available via
> statmount.
>
> Introducing umount_mnt_ns complicated namespace locking and modified
> performance sensitive code [11] and it was agreed upon that fd-based
> statmount would be better.
>
> This code is also available on github [12].
>
> [1]: https://github.com/checkpoint-restore/criu/pull/2754
> [2]:
> https://lore.kernel.org/all/[email protected]/
> [3]:
> https://lore.kernel.org/criu/[email protected]/T/#u
> [4]:
> https://lore.kernel.org/all/[email protected]/
> [5]:
> https://github.com/bsach64/linux/blob/statmount-fd-v5/fd_on_normal_mount.png
> [6]:
> https://github.com/bsach64/linux/blob/statmount-fd-v5/file_on_unmounted_mount.png
> [7]:
> https://lore.kernel.org/all/[email protected]/
> [8]:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/file.h#n97
> [9]:
> https://lore.kernel.org/linux-fsdevel/[email protected]/
> [10]:
> https://lore.kernel.org/linux-fsdevel/[email protected]/
> [11]:
> https://lore.kernel.org/linux-fsdevel/[email protected]/
> [12]: https://github.com/bsach64/linux/tree/statmount-fd-v7
>
> Bhavik Sachdev (3):
> statmount: permission check should return EPERM
> statmount: accept fd as a parameter
> selftests: statmount: tests for STATMOUNT_BY_FD
>
> fs/namespace.c | 102 ++++---
> include/uapi/linux/mount.h | 10 +-
> .../filesystems/statmount/statmount.h | 15 +-
> .../filesystems/statmount/statmount_test.c | 261 +++++++++++++++++-
> .../filesystems/statmount/statmount_test_ns.c | 101 ++++++-
> 5 files changed, 430 insertions(+), 59 deletions(-)
This looks useful.
Reviewed-by: Jeff Layton <[email protected]>