https://jira.sw.ru/browse/PSBM-81411

On 14.02.2018 14:34, Kirill Tkhai wrote:
> [This patch is accepted in ms and is going to main tree.
>  As there is no idr_set_cursor() in 3.10 I've directly
>  used data->idr.cur on port].
> 
> Watch descriptor is id of the watch created by inotify_add_watch().
> It is allocated in inotify_add_to_idr(), and takes the numbers
> starting from 1. Every new inotify watch obtains next available
> number (usually, old + 1), as served by idr_alloc_cyclic().
> 
> CRIU (Checkpoint/Restore In Userspace) project supports inotify
> files, and restores watched descriptors with the same numbers,
> they had before dump. Since there was no kernel support, we
> had to use cycle to add a watch with specific descriptor id:
> 
>       while (1) {
>               int wd;
> 
>               wd = inotify_add_watch(inotify_fd, path, mask);
>               if (wd < 0) {
>                       break;
>               } else if (wd == desired_wd_id) {
>                       ret = 0;
>                       break;
>               }
> 
>               inotify_rm_watch(inotify_fd, wd);
>       }
> 
> (You may find the actual code at the below link:
> https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)
> 
> The cycle is suboptiomal and very expensive, but since there is no better
> kernel support, it was the only way to restore that. Happily, we had met
> mostly descriptors with small id, and this approach had worked somehow.
> 
> But recent time containers with inotify with big watch descriptors
> begun to come, and this way stopped to work at all. When descriptor id
> is something about 0x34d71d6, the restoring process spins in busy loop
> for a long time, and the restore hungs and delay of migration from node
> to node could easily be watched.
> 
> This patch aims to solve this problem. It introduces new ioctl
> INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
> watch descriptor from userspace. It simply calls idr_set_cursor() primitive
> to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
> will return this id, if it is not occupied. This is the way which is
> used to restore some other resources from userspace. For example,
> /proc/sys/kernel/ns_last_pid works the same for task pids.
> 
> The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
> may exclude it.
> 
> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
> Reviewed-by: Cyrill Gorcunov <gorcu...@openvz.org>
> Reviewed-by: Andrew Morton <a...@linux-foundation.org>
> ---
>  fs/notify/inotify/inotify_user.c |   14 ++++++++++++++
>  include/uapi/linux/inotify.h     |    8 ++++++++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/fs/notify/inotify/inotify_user.c 
> b/fs/notify/inotify/inotify_user.c
> index cc39a5d84a0e..fe93507e4d98 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -309,6 +309,20 @@ static long inotify_ioctl(struct file *file, unsigned 
> int cmd,
>               spin_unlock(&group->notification_lock);
>               ret = put_user(send_len, (int __user *) p);
>               break;
> +#ifdef CONFIG_CHECKPOINT_RESTORE
> +     case INOTIFY_IOC_SETNEXTWD:
> +             ret = -EINVAL;
> +             if (arg >= 1 && arg <= INT_MAX) {
> +                     struct inotify_group_private_data *data;
> +
> +                     data = &group->inotify_data;
> +                     spin_lock(&data->idr_lock);
> +                     data->idr.cur = (unsigned int)arg;
> +                     spin_unlock(&data->idr_lock);
> +                     ret = 0;
> +             }
> +             break;
> +#endif /* CONFIG_CHECKPOINT_RESTORE */
>       }
>  
>       return ret;
> diff --git a/include/uapi/linux/inotify.h b/include/uapi/linux/inotify.h
> index e6bf35b2dd34..ce8ac99480fa 100644
> --- a/include/uapi/linux/inotify.h
> +++ b/include/uapi/linux/inotify.h
> @@ -70,5 +70,13 @@ struct inotify_event {
>  #define IN_CLOEXEC O_CLOEXEC
>  #define IN_NONBLOCK O_NONBLOCK
>  
> +/*
> + * ioctl numbers: inotify uses 'I' prefix for all ioctls,
> + * except historical FIONREAD, which is based on 'T'.
> + *
> + * INOTIFY_IOC_SETNEXTWD: set desired number of next created
> + * watch descriptor.
> + */
> +#define INOTIFY_IOC_SETNEXTWD        _IOW('I', 0, __s32)
>  
>  #endif /* _UAPI_LINUX_INOTIFY_H */
> 
_______________________________________________
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to