From: Andrei Vagin <[email protected]> Now a shared group can be only inherited from a source mount. This patch adds an ability to add a mount into an existing shared group.
mount(source, target, NULL, MS_SET_GROUP, NULL) mount() with the MS_SET_GROUP flag adds the "target" mount into a group of the "source" mount. The calling process has to have the CAP_SYS_ADMIN capability in namespaces of these mounts. The source and the target mounts have to have the same super block. This new functionality together with "mnt: Tuck mounts under others instead of creating shadow/side mounts." allows CRIU to dump and restore any set of mount namespaces. Currently we have a lot of issues about dumping and restoring mount namespaces. The bigest problem is that we can't construct mount trees directly due to several reasons: * groups can't be set, they can be only inherited * file systems has to be mounted from the specified user namespaces * the mount() syscall doesn't just create one mount -- the mount is also propagated to all members of a parent group * umount() doesn't detach mounts from all members of a group (mounts with children are not umounted) * mounts are propagated underneath of existing mounts * mount() doesn't allow to make bind-mounts between two namespaces * processes can have opened file descriptors to overmounted files All these operations are non-trivial, making the task of restoring a mount namespace practically unsolvable for reasonable time. The proposed change allows to restore a mount namespace in a direct manner, without any super complex logic. Cc: Eric W. Biederman <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrei Vagin <[email protected]> Patch hangs long in lkml without much review: https://patchwork.kernel.org/patch/9703885/ But with it we can implement correct mounts restore in vzcriu much easier. Add some restrictions: a) prohibit setting group on non-mnt_root dentry; b) prohibit destination mount to be in non-current mntns; c) only super or pseudosuper ve can set group. https://jira.sw.ru/browse/PSBM-58617 Signed-off-by: Pavel Tikhomirov <[email protected]> --- fs/namespace.c | 65 +++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 6 ++++ 2 files changed, 71 insertions(+) diff --git a/fs/namespace.c b/fs/namespace.c index b06fdd118629..2bc53000c026 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2324,6 +2324,69 @@ static inline int tree_contains_unbindable(struct mount *mnt) return 0; } +static int do_set_group(struct path *path, const char *sibling_name) +{ + struct ve_struct *ve = get_exec_env(); + struct mount *sibling, *mnt; + struct path sibling_path; + int err; + + if (!ve_is_super(ve) && !ve->is_pseudosuper) + return -EPERM; + + if (!sibling_name || !*sibling_name) + return -EINVAL; + + if (path->dentry != path->mnt->mnt_root) + return -EINVAL; + + err = kern_path(sibling_name, LOOKUP_FOLLOW, &sibling_path); + if (err) + return err; + + err = -EINVAL; + if (sibling_path.dentry != sibling_path.mnt->mnt_root) + goto out_put; + + sibling = real_mount(sibling_path.mnt); + mnt = real_mount(path->mnt); + + if (!check_mnt(mnt)) + goto out_put; + + namespace_lock(); + + err = -EPERM; + if (!sibling->mnt_ns || + !ns_capable(sibling->mnt_ns->user_ns, CAP_SYS_ADMIN)) + goto out_unlock; + + err = -EINVAL; + if (sibling->mnt.mnt_sb != mnt->mnt.mnt_sb) + goto out_unlock; + + if (IS_MNT_SHARED(mnt) || IS_MNT_SLAVE(mnt)) + goto out_unlock; + + if (IS_MNT_SLAVE(sibling)) { + list_add(&mnt->mnt_slave, &sibling->mnt_slave); + mnt->mnt_master = sibling->mnt_master; + } + + if (IS_MNT_SHARED(sibling)) { + mnt->mnt_group_id = sibling->mnt_group_id; + list_add(&mnt->mnt_share, &sibling->mnt_share); + set_mnt_shared(mnt); + } + + err = 0; +out_unlock: + namespace_unlock(); +out_put: + path_put(&sibling_path); + return err; +} + static int do_move_mount(struct path *path, const char *old_name) { struct path old_path, parent_path; @@ -2810,6 +2873,8 @@ long do_mount(const char *dev_name, const char __user *dir_name, retval = do_change_type(&path, flags); else if (flags & MS_MOVE) retval = do_move_mount(&path, dev_name); + else if (flags & MS_SET_GROUP) + retval = do_set_group(&path, dev_name); else retval = do_new_mount(&path, type_page, sb_flags, mnt_flags, dev_name, data_page); diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 876c308d57c0..699ad890ac76 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -132,6 +132,12 @@ struct inodes_stat_t { #define MS_STRICTATIME (1<<24) /* Always perform atime updates */ #define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */ +/* + * Here are commands and flags. Commands are handled in do_mount() + * and can intersect with kernel internal flags. + */ +#define MS_SET_GROUP (1<<26) /* Add a mount into a shared group */ + /* These sb flags are internal to the kernel */ #define MS_SUBMOUNT (1<<26) #define MS_NOREMOTELOCK (1<<27) -- 2.24.1 _______________________________________________ Devel mailing list [email protected] https://lists.openvz.org/mailman/listinfo/devel
