On Wed, Feb 18, 2026 at 12:32 AM T.J. Mercier <[email protected]> wrote: > > On Tue, Feb 17, 2026 at 1:25 PM Amir Goldstein <[email protected]> wrote: > > > > On Tue, Feb 17, 2026 at 9:26 PM T.J. Mercier <[email protected]> wrote: > > > > > > On Tue, Feb 17, 2026 at 2:19 AM Amir Goldstein <[email protected]> wrote: > > > > > > > > On Thu, Feb 12, 2026 at 01:58:13PM -0800, T.J. Mercier wrote: > > > > > Currently some kernfs files (e.g. cgroup.events, memory.events) > > > > > support > > > > > inotify watches for IN_MODIFY, but unlike with regular filesystems, > > > > > they > > > > > do not receive IN_DELETE_SELF or IN_IGNORED events when they are > > > > > removed. > > > > > > > > > > This creates a problem for processes monitoring cgroups. For example, > > > > > a > > > > > service monitoring memory.events for memory.high breaches needs to > > > > > know > > > > > when a cgroup is removed to clean up its state. Where it's known that > > > > > a > > > > > cgroup is removed when all processes die, without IN_DELETE_SELF the > > > > > service must resort to inefficient workarounds such as: > > > > > 1. Periodically scanning procfs to detect process death (wastes CPU > > > > > and > > > > > is susceptible to PID reuse). > > > > > 2. Placing an additional IN_DELETE watch on the parent directory > > > > > (wastes resources managing double the watches). > > > > > 3. Holding a pidfd for every monitored cgroup (can exhaust file > > > > > descriptors). > > > > > > > > > > This patch enables kernfs to send IN_DELETE_SELF and IN_IGNORED > > > > > events. > > > > > This allows applications to rely on a single existing watch on the > > > > > file > > > > > of interest (e.g. memory.events) to receive notifications for both > > > > > modifications and the eventual removal of the file, as well as > > > > > automatic > > > > > watch descriptor cleanup, simplifying userspace logic and improving > > > > > resource efficiency. > > > > > > > > This looks very useful, > > > > But, > > > > How will the application know that ti can rely on IN_DELETE_SELF > > > > from cgroups if this is not an opt-in feature? > > > > > > > > Essentially, this is similar to the discussions on adding "remote" > > > > fs notification support (e.g. for smb) and in those discussions > > > > I insist that "remote" notification should be opt-in (which is > > > > easy to do with an fanotify init flag) and I claim that mixing > > > > "remote" events with "local" events on the same group is undesired. > > > > > > I think this situation is a bit different because this isn't adding > > > new features to fsnotify. This is filling a gap that you'd expect to > > > work if you only read the cgroups or inotify documentation without > > > realizing that kernfs is simply wired up differently for notification > > > support than most other filesystems, and only partially supports the > > > existing notification events. It's opt-in in the sense that an > > > application registers for IN_DELETE_SELF, but other than a runtime > > > test like what I added in the selftests I'm not sure if there's a good > > > way to detect the kernel will actually send the event. Practically > > > speaking though, if merged upstream I will backport these patches to > > > all the kernels we use so a runtime check shouldn't be necessary for > > > our applications. > > > > > > > That's besides the point. > > An application does not know if it running on a kernel with the backported > > patch or not, so an application needs to either rely on getting the event > > or it has to poll. How will the application know if it needs to poll or not? > > Either by testing for the behavior at runtime like I mentioned, or by > depending on certification testing for the platform the application is > running on which would verify that the selftests I added pass. We do > the former to check for the presence of other features like swappiness > support with memory.reclaim, and also the latter for all devices. > > > > > However, IN_IGNORED is created when an inotify watch is removed > > > > and IN_DELETE_SELF is called when a vfs inode is destroyed. > > > > When setting an inotify watch for IN_IGNORED|IN_DELETE_SELF there > > > > has to be a vfs inode with inotify mark attached, so why are those > > > > events not created already? What am I missing? > > > > > > The difference is vfs isn't involved when kernfs files are unlinked. > > > > No, but the vfs is involved when the last reference on the kernfs inode > > is dropped. > > > > > When a cgroup removal occurs, we get to kernfs_remove via kernfs' > > > inode_operations without calling vfs_unlink. (You can't rm cgroup > > > files directly.) > > > > > > > Yes and if there was a vfs inode for this kernfs object, the vfs inode > > needs to > > be dropped. > > It should be, but it isn't right now. > > > > > Are you expecting to get IN_IGNORED|IN_DELETE_SELF on an entry > > > > while watching the parent? Because this is not how the API works. > > > > > > No, only on the file being watched. The parent should only get > > > IN_DELETE, but I read your feedback below and I'm fine with removing > > > that part and just sending the DELETE_SELF and IN_IGNORED events. > > > > > > > So if the file was being watched, some application needed to call > > inotify_add_watch() with the user path to the cgroupfs inode > > and inotify watch keeps a live reference to this vfs inode. > > > > When the cgroup is being destroyed something needs to drop > > this vfs inode and call __destroy_inode() -> fsnotify_inode_delete() > > which should remove the inotify watch and result in IN_IGNORED. > > Nothing like this exists before this patch. > > > IN_DELETE_SELF is a different story, because the inode does not > > have zero i_nlink. > > > > I did not try to follow the code path of cgroupfs destroy when an > > inotify watch on a cgroup file exists, but this is what I expect. > > Please explain - what am I missing? > > Yes that's the problem here. The inode isn't dropped unless the watch > is removed, and the watch isn't removed because kernfs doesn't go > through vfs to notify about file removal. There is nothing to trigger > dropping the watch and the associated inode reference except this > patch calling into fsnotify_inoderemove which both sends > IN_DELETE_SELF and calls __fsnotify_inode_delete for the IN_IGNORED > and inode cleanup. > > Without this, the watch and inode persist after file deletion until > the process exits and file descriptors are cleaned up, or until > inotify_rm_watch gets called manually. >
Yeh, that's not good. Will be happy to see that fixed. Thanks, Amir.

