On Mon, Feb 23, 2026 at 06:27:31AM -1000, Tejun Heo wrote:
> (cc'ing Christian Brauner)
>
> On Sat, Feb 21, 2026 at 06:11:28PM +0200, Amir Goldstein wrote:
> > On Sat, Feb 21, 2026 at 12:32 AM Tejun Heo <[email protected]> wrote:
> > >
> > > Hello, Amir.
> > >
> > > On Fri, Feb 20, 2026 at 10:11:15PM +0200, Amir Goldstein wrote:
> > > > > Yeah, that can be useful. For cgroupfs, there would probably need to
> > > > > be a
> > > > > way to scope it so that it can be used on delegation boundaries too
> > > > > (which
> > > > > we can require to coincide with cgroup NS boundaries).
> > > >
> > > > I have no idea what the above means.
> > > > I could ask Gemini or you and I prefer the latter ;)
> > >
> > > Ah, you chose wrong. :)
> > >
> > > > What are delegation boundaries and NFS boundaries in this context?
> > >
> > > cgroup delegation is giving control of a subtree to someone else:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/tree/Documentation/admin-guide/cgroup-v2.rst#n537
> > >
> > > There's an old way of doing it by changing perms on some files and new way
> > > using cgroup namespace.
> > >
> > > > > Would it be possible to make FAN_MNT_ATTACH work for that?
> > > >
> > > > FAN_MNT_ATTACH is an event generated on a mntns object.
> > > > If "cgroup NS boundaries" is referring to a mntns object and if
> > > > this object is available in the context of cgroup create/destroy
> > > > then it should be possible.
> > >
> > > Great, yes, cgroup namespace way should work then.
> > >
> > > > But FAN_MNT_ATTACH reports a mountid. Is there a mountid
> > > > to report on cgroup create? Probably not?
> > >
> > > Sorry, I thought that was per-mount recursive file event monitoring.
> > > FAN_MARK_MOUNT looks like the right thing if we want to allow monitoring
> > > cgroup creations / destructions in a subtree without recursively watching
> > > each cgroup.
> >
> > The problem sounds very similar to subtree monitoring for mkdir/rmdir on
> > a filesystem, which is a problem that we have not yet solved.
> >
> > The problem with FAN_MARK_MOUNT is that it does not support the
> > events CREATE/DELETE, because those events are currently
>
> Ah, bummer.
>
> > monitored in context where the mount is not available and anyway
> > what users want to get notified on a deleted file/dir in a subtree
> > regardless of the mount through which the create/delete was done.
> >
> > Since commit 58f5fbeb367ff ("fanotify: support watching filesystems
> > and mounts inside userns") and fnaotify groups can be associated
> > with a userns.
> >
> > I was thinking that we can have a model where events are delivered
> > to a listener based on whether or not the uid/gid of the object are
> > mappable to the userns of the group.
>
> Given how different NSes can be used independently of each other, it'd
> probably be cleaner if it doesn't have to depend on another NS.
>
> > In a filesystem, this criteria cannot guarantee the subtree isolation.
> > I imagine that for delegated cgroups this criteria could match what
> > you need, but I am basing this on pure speculation.
>
> There's a lot of flexibility in the mechanism, so it's difficult to tell.
> e.g. There's nothing preventing somebody from creating two separate subtrees
> delegated to the same user.
Delegation is based on inode ownership I'm not sure how well this will
fit into the fanotify model. Maybe the group logic for userns that
fanotify added works. I'm not super sure.
> Christian was mentioning allowing separate super for different cgroup mounts
> in another thread. cc'ing him for context.
If cgroupfs changes to tmpfs semantics where each mount gives you a new
superblock then it's possible to give each container its own superblock.
That in turn would make it possible to place fanotify watches on the
superblock itself. I think you'd roughly need something like the
following permission model:
* Cgroupfs mounted on the host -> would require global CAP_SYS_ADMIN as
you'd get notified about all tree changes ofc.
* If cgroupfs is mounted in user namespace with a cgroup namespace then
allow the container to monitor the whole superblock.
I think kernfs currently has logic to gate mounting of sysfs in a
container on the network namespace. We would need similar logic to gate
creation of a new superblock for cgroupfs behind the cgroup namespace
(that's the kernfs tagging mechanism iirc).
There's some more annoyance ofc: the current model has one superblock
for the whole system. As such each cgroup is associated with exactly one
inode. So any ownership changes to a given inode are visible _system
wide_. That leads to problems such as an unpriv user having a hard time
deleting cgroups that were delegated to an unprivileged container that
it owns - at least not without setting up a helper userns and running rm
-rf in it.
Note, if we allow separate cgroup superblocks then this automatically
entails that multiple inodes from different superblocks refer to the
same underlying cgroup - like separate procfs instances have different
inodes that refere to the same task struct or whatever. This should be
fine locking wise because you serialize on locks associated with the
underlying cgroup - which would be referenced by all inodes.
With this possible cgroupfs will be able to be mounted inside of a
container with a separate inode/dentry tree where each
inode->i_{uid,gid} can be set according to the containers user
namespace.
That also gets rid of the aforementioned problem where an unprivileged
container user on the host cannot remove cgroups that were delegated to
the container.
It also introduces a change in the delegation model that is worth
considering:
Right now if you delegate ownership it means chown()ing a bunch of files
to the relevant user. With separate superblocks mountable in containers
you could technically delegate write access to multiple containers at
the same time even though they might have completely distinct user
namespaces with isolated idmappings (iow, they're global uid/gid ranges
don't overlap and so they can't meaningfully interact with each other).
If container A mounts a new cgroupfs instance and container B mounts a
new cgroupfs instance and someone was crazy enough to let both A and B
share the same cgroup they could both write around in it. It would also
mean that all files in a given cgroup change ownership _within the
superblock that was mounted_ - other superblocks are ofc unaffected:
mkdir /sys/fs/cgroup/lets/go/deeper/
delegate the cgroup
echo 1234 > /sys/fs/cgroup/lets/go/deeper
Now the container payload running as 1234 does:
unshare(CLONE_NEWUSER | CLONE_NEWNS | CLONE_NEWCGROUP);
Set ups the rootfs and chroots into it and then mounts cgroupfs within
its namespaces:
mount("cgroup", "/sys/fs/cgroup", "cgroup2", 0, NULL);
This would create a new cgroupfs superblock with "deeper" being the
root dentry - similar to how "remounting" changes the visibility. Then
the idmapping associated with the user namespace is taken into account
and all files under "deeper" will be owned by the container root making
it all writable for the container (That is different from today where
you need to chown around.).
TL;DR:
* multi-instance cgroupfs implies multiple inodes for the same cgroup
with custom ownership for each inode
* multi-instance cgroupfs means per-container superblock fanotify
watches
* multi-instance cgroupfs means per-container superblock mount options