On Tue, Feb 24, 2026 at 12:03 PM Christian Brauner <[email protected]> wrote:
>
> On Mon, Feb 23, 2026 at 06:27:31AM -1000, Tejun Heo wrote:
> > (cc'ing Christian Brauner)
> >
> > On Sat, Feb 21, 2026 at 06:11:28PM +0200, Amir Goldstein wrote:
> > > On Sat, Feb 21, 2026 at 12:32 AM Tejun Heo <[email protected]> wrote:
> > > >
> > > > Hello, Amir.
> > > >
> > > > On Fri, Feb 20, 2026 at 10:11:15PM +0200, Amir Goldstein wrote:
> > > > > > Yeah, that can be useful. For cgroupfs, there would probably need 
> > > > > > to be a
> > > > > > way to scope it so that it can be used on delegation boundaries too 
> > > > > > (which
> > > > > > we can require to coincide with cgroup NS boundaries).
> > > > >
> > > > > I have no idea what the above means.
> > > > > I could ask Gemini or you and I prefer the latter ;)
> > > >
> > > > Ah, you chose wrong. :)
> > > >
> > > > > What are delegation boundaries and NFS boundaries in this context?
> > > >
> > > > cgroup delegation is giving control of a subtree to someone else:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/tree/Documentation/admin-guide/cgroup-v2.rst#n537
> > > >
> > > > There's an old way of doing it by changing perms on some files and new 
> > > > way
> > > > using cgroup namespace.
> > > >
> > > > > > Would it be possible to make FAN_MNT_ATTACH work for that?
> > > > >
> > > > > FAN_MNT_ATTACH is an event generated on a mntns object.
> > > > > If "cgroup NS boundaries" is referring to a mntns object and if
> > > > > this object is available in the context of cgroup create/destroy
> > > > > then it should be possible.
> > > >
> > > > Great, yes, cgroup namespace way should work then.
> > > >
> > > > > But FAN_MNT_ATTACH reports a mountid. Is there a mountid
> > > > > to report on cgroup create? Probably not?
> > > >
> > > > Sorry, I thought that was per-mount recursive file event monitoring.
> > > > FAN_MARK_MOUNT looks like the right thing if we want to allow monitoring
> > > > cgroup creations / destructions in a subtree without recursively 
> > > > watching
> > > > each cgroup.
> > >
> > > The problem sounds very similar to subtree monitoring for mkdir/rmdir on
> > > a filesystem, which is a problem that we have not yet solved.
> > >
> > > The problem with FAN_MARK_MOUNT is that it does not support the
> > > events CREATE/DELETE, because those events are currently
> >
> > Ah, bummer.
> >
> > > monitored in context where the mount is not available and anyway
> > > what users want to get notified on a deleted file/dir in a subtree
> > > regardless of the mount through which the create/delete was done.
> > >
> > > Since commit 58f5fbeb367ff ("fanotify: support watching filesystems
> > > and mounts inside userns") and fnaotify groups can be associated
> > > with a userns.
> > >
> > > I was thinking that we can have a model where events are delivered
> > > to a listener based on whether or not the uid/gid of the object are
> > > mappable to the userns of the group.
> >
> > Given how different NSes can be used independently of each other, it'd
> > probably be cleaner if it doesn't have to depend on another NS.
> >
> > > In a filesystem, this criteria cannot guarantee the subtree isolation.
> > > I imagine that for delegated cgroups this criteria could match what
> > > you need, but I am basing this on pure speculation.
> >
> > There's a lot of flexibility in the mechanism, so it's difficult to tell.
> > e.g. There's nothing preventing somebody from creating two separate subtrees
> > delegated to the same user.
>
> Delegation is based on inode ownership I'm not sure how well this will
> fit into the fanotify model. Maybe the group logic for userns that
> fanotify added works. I'm not super sure.
>
> > Christian was mentioning allowing separate super for different cgroup mounts
> > in another thread. cc'ing him for context.
>
> If cgroupfs changes to tmpfs semantics where each mount gives you a new
> superblock then it's possible to give each container its own superblock.
> That in turn would make it possible to place fanotify watches on the
> superblock itself. I think you'd roughly need something like the
> following permission model:
>

It's hard for me to estimate the effort of changing to multi sb model,
but judging by the length of the email I trimmed below, it does not
sound trivial...

How do you guys feel about something like this patch which associates
an owner userns to every cgroup?

I have this POC branch from a long time ago [1] to filter all events
on sb by in_userns() criteria.  The semantics for real filesystems
were a bit difficult, but perhaps this model can work well for these
pseudo singleton fs.

I am trying to work on a model that could be useful for both cgroupfs
and nsfs:

If user is capable in userns, user will be able to set an sb
watch for all events (say DELETE_SELF) on the sb, for objects
whose owner_userns is in_userns() of the fanotify listener.

This will enable watching for torn down cgroups and namepsaces
which are visible to said user via delegated cgroups mount
or via listns().

I would like to allow calling fsnotify_obj_remove() hook with
encoded object fid (e.g. nsfs_file_handle) instead of the vfs inode,
so that cgroupfs/nsfs could report dying objects without needing
to associate a vfs inode with them.

WDYT? Is this an interesting direction to persure?

Thanks,
Amir.

[1] 
https://lore.kernel.org/linux-fsdevel/CAOQ4uxgt1Cx5jx3L6iaDvbzCWPv=fcmglaa9odkiu9h718m...@mail.gmail.com/
From 4b3a56b8ca548354214329729997a78c72a016d3 Mon Sep 17 00:00:00 2001
From: Amir Goldstein <[email protected]>
Date: Tue, 3 Mar 2026 14:04:22 +0100
Subject: [PATCH] cgroup: track owner_userns per cgroup

Add owner_userns field to struct cgroup to record which user namespace
owns a given cgroup.

For hierarchy roots, the owner is always init_user_ns.
For cgroups created via mkdir (cgroup_create()), possibly inside a
delegated cgroup namespace, the owner is the user namespace of the
creating task's cgroup namespace.

This field is a prerequisite for delivering userns-scoped fsnotify
events (e.g. FAN_DELETE_SELF via FAN_FILESYSTEM_MARK) when a cgroup is
destroyed, allowing a sufficiently privileged admin inside a delegated
cgroup namespace to watch for cgroup teardown without requiring access
to the full system view.

Signed-off-by: Amir Goldstein <[email protected]>
---
 include/linux/cgroup-defs.h | 8 ++++++++
 kernel/cgroup/cgroup.c      | 6 ++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index bb92f5c169ca2..4ee344792a1d5 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -33,6 +33,7 @@ struct kernfs_ops;
 struct kernfs_open_file;
 struct seq_file;
 struct poll_table_struct;
+struct user_namespace;
 
 #define MAX_CGROUP_TYPE_NAMELEN 32
 #define MAX_CGROUP_ROOT_NAMELEN 64
@@ -551,6 +552,13 @@ struct cgroup {
 
 	struct cgroup_root *root;
 
+	/*
+	 * The user namespace that owns this cgroup: the creating task's
+	 * cgroup_ns->user_ns for child cgroups, or init_user_ns for
+	 * hierarchy roots.  Determines the scope of filesystem watches.
+	 */
+	struct user_namespace *owner_userns;
+
 	/*
 	 * List of cgrp_cset_links pointing at css_sets with tasks in this
 	 * cgroup.  Protected by css_set_lock.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c22cda7766d84..e0beaf5cc8c49 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1381,6 +1381,7 @@ static void cgroup_exit_root_id(struct cgroup_root *root)
 
 void cgroup_free_root(struct cgroup_root *root)
 {
+	put_user_ns(root->cgrp.owner_userns);
 	kfree_rcu(root, rcu);
 }
 
@@ -2195,6 +2196,7 @@ int cgroup_setup_root(struct cgroup_root *root, u32 ss_mask)
 	root_cgrp->kn = kernfs_root_to_node(root->kf_root);
 	WARN_ON_ONCE(cgroup_ino(root_cgrp) != 1);
 	root_cgrp->ancestors[0] = root_cgrp;
+	root_cgrp->owner_userns = get_user_ns(&init_user_ns);
 
 	ret = css_populate_dir(&root_cgrp->self);
 	if (ret)
@@ -5607,6 +5609,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
+			put_user_ns(cgrp->owner_userns);
 			kfree(cgrp);
 		} else {
 			/*
@@ -5848,6 +5851,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (!cgrp)
 		return ERR_PTR(-ENOMEM);
 
+	cgrp->owner_userns = get_user_ns(current->nsproxy->cgroup_ns->user_ns);
+
 	ret = percpu_ref_init(&cgrp->self.refcnt, css_release, 0, GFP_KERNEL);
 	if (ret)
 		goto out_free_cgrp;
@@ -5956,6 +5961,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
+	put_user_ns(cgrp->owner_userns);
 	kfree(cgrp);
 	return ERR_PTR(ret);
 }
-- 
2.53.0

Reply via email to