On Thu, Feb 12, 2026 at 01:58:13PM -0800, T.J. Mercier wrote:
> Currently some kernfs files (e.g. cgroup.events, memory.events) support
> inotify watches for IN_MODIFY, but unlike with regular filesystems, they
> do not receive IN_DELETE_SELF or IN_IGNORED events when they are
> removed.
> 
> This creates a problem for processes monitoring cgroups. For example, a
> service monitoring memory.events for memory.high breaches needs to know
> when a cgroup is removed to clean up its state. Where it's known that a
> cgroup is removed when all processes die, without IN_DELETE_SELF the
> service must resort to inefficient workarounds such as:
> 1.  Periodically scanning procfs to detect process death (wastes CPU and
>     is susceptible to PID reuse).
> 2.  Placing an additional IN_DELETE watch on the parent directory
>     (wastes resources managing double the watches).
> 3.  Holding a pidfd for every monitored cgroup (can exhaust file
>     descriptors).
> 
> This patch enables kernfs to send IN_DELETE_SELF and IN_IGNORED events.
> This allows applications to rely on a single existing watch on the file
> of interest (e.g. memory.events) to receive notifications for both
> modifications and the eventual removal of the file, as well as automatic
> watch descriptor cleanup, simplifying userspace logic and improving
> resource efficiency.

This looks very useful,
But,
How will the application know that ti can rely on IN_DELETE_SELF
from cgroups if this is not an opt-in feature?

Essentially, this is similar to the discussions on adding "remote"
fs notification support (e.g. for smb) and in those discussions
I insist that "remote" notification should be opt-in (which is
easy to do with an fanotify init flag) and I claim that mixing
"remote" events with "local" events on the same group is undesired.

However, IN_IGNORED is created when an inotify watch is removed
and IN_DELETE_SELF is called when a vfs inode is destroyed.
When setting an inotify watch for IN_IGNORED|IN_DELETE_SELF there
has to be a vfs inode with inotify mark attached, so why are those
events not created already? What am I missing?

Are you expecting to get IN_IGNORED|IN_DELETE_SELF on an entry
while watching the parent? Because this is not how the API works.

I think it should be possible to set a super block fanotify watch
on cgroupfs and get all the FAN_DELETE_SELF events, but maybe we
do not allow this right now, I did not check - just wanted to give
you another direction to follow.

> 
> Implementation details:
> The kernfs notification worker is updated to handle file deletion.
> fsnotify handles sending MODIFY events to both a watched file and its
> parent, but it does not handle sending a DELETE event to the parent and
> a DELETE_SELF event to the watched file in a single call. Therefore,
> separate fsnotify calls are made: one for the parent (DELETE) and one
> for the child (DELETE_SELF), while retaining the optimized single call

IN_DELETE_SELF and IN_IGNORED are special and I don't really mind adding
them to kernfs seeing that they are very useful, but adding IN_DELETE
without adding IN_CREATE, that is very arbitrary and I don't like it as
much.

> for MODIFY events.
> 
> Signed-off-by: T.J. Mercier <[email protected]>
> ---
>  fs/kernfs/dir.c             | 21 +++++++++++++++++++++
>  fs/kernfs/file.c            | 16 ++++++++++------
>  fs/kernfs/kernfs-internal.h |  3 +++
>  3 files changed, 34 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 29baeeb97871..e5bda829fcb8 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -9,6 +9,7 @@
>  
>  #include <linux/sched.h>
>  #include <linux/fs.h>
> +#include <linux/fsnotify_backend.h>
>  #include <linux/namei.h>
>  #include <linux/idr.h>
>  #include <linux/slab.h>
> @@ -1471,6 +1472,23 @@ void kernfs_show(struct kernfs_node *kn, bool show)
>       up_write(&root->kernfs_rwsem);
>  }
>  
> +static void kernfs_notify_file_deleted(struct kernfs_node *kn)
> +{
> +     static DECLARE_WORK(kernfs_notify_deleted_work,
> +                         kernfs_notify_workfn);
> +
> +     guard(spinlock_irqsave)(&kernfs_notify_lock);
> +     /* may overwite already pending FS_MODIFY events */
> +     kn->attr.notify_event = FS_DELETE;
> +
> +     if (!kn->attr.notify_next) {
> +             kernfs_get(kn);
> +             kn->attr.notify_next = kernfs_notify_list;
> +             kernfs_notify_list = kn;
> +             schedule_work(&kernfs_notify_deleted_work);
> +     }
> +}
> +
>  static void __kernfs_remove(struct kernfs_node *kn)
>  {
>       struct kernfs_node *pos, *parent;
> @@ -1520,6 +1538,9 @@ static void __kernfs_remove(struct kernfs_node *kn)
>                       struct kernfs_iattrs *ps_iattr =
>                               parent ? parent->iattr : NULL;
>  
> +                     if (kernfs_type(pos) == KERNFS_FILE)
> +                             kernfs_notify_file_deleted(pos);
> +
>                       /* update timestamps on the parent */
>                       down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>  
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index e978284ff983..2d21af3cfcad 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -37,8 +37,8 @@ struct kernfs_open_node {
>   */
>  #define KERNFS_NOTIFY_EOL                    ((void *)&kernfs_notify_list)
>  
> -static DEFINE_SPINLOCK(kernfs_notify_lock);
> -static struct kernfs_node *kernfs_notify_list = KERNFS_NOTIFY_EOL;
> +DEFINE_SPINLOCK(kernfs_notify_lock);
> +struct kernfs_node *kernfs_notify_list = KERNFS_NOTIFY_EOL;
>  
>  static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node 
> *kn)
>  {
> @@ -909,7 +909,7 @@ static loff_t kernfs_fop_llseek(struct file *file, loff_t 
> offset, int whence)
>       return ret;
>  }
>  
> -static void kernfs_notify_workfn(struct work_struct *work)
> +void kernfs_notify_workfn(struct work_struct *work)
>  {
>       struct kernfs_node *kn;
>       struct kernfs_super_info *info;
> @@ -959,15 +959,19 @@ static void kernfs_notify_workfn(struct work_struct 
> *work)
>                       if (p_inode) {
>                               fsnotify(notify_event | FS_EVENT_ON_CHILD,
>                                        inode, FSNOTIFY_EVENT_INODE,
> -                                      p_inode, &name, inode, 0);
> +                                      p_inode, &name,
> +                                      (notify_event == FS_MODIFY) ?
> +                                             inode : NULL, 0);
>                               iput(p_inode);
>                       }
>  
>                       kernfs_put(parent);
>               }
>  
> -             if (!p_inode)
> -                     fsnotify_inode(inode, notify_event);
> +             if (notify_event == FS_DELETE)
> +                     fsnotify_inoderemove(inode);
> +             else if (!p_inode)
> +                     fsnotify_inode(inode, FS_MODIFY);

Didn't you mean notify_event?

Thanks,
Amir.

Reply via email to