https://jira.sw.ru/browse/PSBM-81411
On 14.02.2018 14:34, Kirill Tkhai wrote: > [This patch is accepted in ms and is going to main tree. > As there is no idr_set_cursor() in 3.10 I've directly > used data->idr.cur on port]. > > Watch descriptor is id of the watch created by inotify_add_watch(). > It is allocated in inotify_add_to_idr(), and takes the numbers > starting from 1. Every new inotify watch obtains next available > number (usually, old + 1), as served by idr_alloc_cyclic(). > > CRIU (Checkpoint/Restore In Userspace) project supports inotify > files, and restores watched descriptors with the same numbers, > they had before dump. Since there was no kernel support, we > had to use cycle to add a watch with specific descriptor id: > > while (1) { > int wd; > > wd = inotify_add_watch(inotify_fd, path, mask); > if (wd < 0) { > break; > } else if (wd == desired_wd_id) { > ret = 0; > break; > } > > inotify_rm_watch(inotify_fd, wd); > } > > (You may find the actual code at the below link: > https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577) > > The cycle is suboptiomal and very expensive, but since there is no better > kernel support, it was the only way to restore that. Happily, we had met > mostly descriptors with small id, and this approach had worked somehow. > > But recent time containers with inotify with big watch descriptors > begun to come, and this way stopped to work at all. When descriptor id > is something about 0x34d71d6, the restoring process spins in busy loop > for a long time, and the restore hungs and delay of migration from node > to node could easily be watched. > > This patch aims to solve this problem. It introduces new ioctl > INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created > watch descriptor from userspace. It simply calls idr_set_cursor() primitive > to populate idr::idr_next, so that next idr_alloc_cyclic() allocation > will return this id, if it is not occupied. This is the way which is > used to restore some other resources from userspace. For example, > /proc/sys/kernel/ns_last_pid works the same for task pids. > > The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system > may exclude it. > > Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com> > Reviewed-by: Cyrill Gorcunov <gorcu...@openvz.org> > Reviewed-by: Andrew Morton <a...@linux-foundation.org> > --- > fs/notify/inotify/inotify_user.c | 14 ++++++++++++++ > include/uapi/linux/inotify.h | 8 ++++++++ > 2 files changed, 22 insertions(+) > > diff --git a/fs/notify/inotify/inotify_user.c > b/fs/notify/inotify/inotify_user.c > index cc39a5d84a0e..fe93507e4d98 100644 > --- a/fs/notify/inotify/inotify_user.c > +++ b/fs/notify/inotify/inotify_user.c > @@ -309,6 +309,20 @@ static long inotify_ioctl(struct file *file, unsigned > int cmd, > spin_unlock(&group->notification_lock); > ret = put_user(send_len, (int __user *) p); > break; > +#ifdef CONFIG_CHECKPOINT_RESTORE > + case INOTIFY_IOC_SETNEXTWD: > + ret = -EINVAL; > + if (arg >= 1 && arg <= INT_MAX) { > + struct inotify_group_private_data *data; > + > + data = &group->inotify_data; > + spin_lock(&data->idr_lock); > + data->idr.cur = (unsigned int)arg; > + spin_unlock(&data->idr_lock); > + ret = 0; > + } > + break; > +#endif /* CONFIG_CHECKPOINT_RESTORE */ > } > > return ret; > diff --git a/include/uapi/linux/inotify.h b/include/uapi/linux/inotify.h > index e6bf35b2dd34..ce8ac99480fa 100644 > --- a/include/uapi/linux/inotify.h > +++ b/include/uapi/linux/inotify.h > @@ -70,5 +70,13 @@ struct inotify_event { > #define IN_CLOEXEC O_CLOEXEC > #define IN_NONBLOCK O_NONBLOCK > > +/* > + * ioctl numbers: inotify uses 'I' prefix for all ioctls, > + * except historical FIONREAD, which is based on 'T'. > + * > + * INOTIFY_IOC_SETNEXTWD: set desired number of next created > + * watch descriptor. > + */ > +#define INOTIFY_IOC_SETNEXTWD _IOW('I', 0, __s32) > > #endif /* _UAPI_LINUX_INOTIFY_H */ > _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel