On Mon, Oct 9, 2023 at 9:29 AM Lokesh Gidra <[email protected]> wrote: > > On Mon, Oct 9, 2023 at 5:24 PM David Hildenbrand <[email protected]> wrote: > > > > On 09.10.23 18:21, Suren Baghdasaryan wrote: > > > On Mon, Oct 9, 2023 at 7:38 AM David Hildenbrand <[email protected]> wrote: > > >> > > >> On 09.10.23 08:42, Suren Baghdasaryan wrote: > > >>> From: Andrea Arcangeli <[email protected]> > > >>> > > >>> Implement the uABI of UFFDIO_MOVE ioctl. > > >>> UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application > > >>> needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are > > >>> available (in userspace) for recycling, as is usually the case in heap > > >>> compaction algorithms, then we can avoid the page allocation and memcpy > > >>> (done by UFFDIO_COPY). Also, since the pages are recycled in the > > >>> userspace, we avoid the need to release (via madvise) the pages back to > > >>> the kernel [2]. > > >>> We see over 40% reduction (on a Google pixel 6 device) in the compacting > > >>> thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was > > >>> measured using a benchmark that emulates a heap compaction > > >>> implementation > > >>> using userfaultfd (to allow concurrent accesses by application threads). > > >>> More details of the usecase are explained in [2]. > > >>> Furthermore, UFFDIO_MOVE enables moving swapped-out pages without > > >>> touching them within the same vma. Today, it can only be done by mremap, > > >>> however it forces splitting the vma. > > >>> > > >>> [1] > > >>> https://lore.kernel.org/all/[email protected]/ > > >>> [2] > > >>> https://lore.kernel.org/linux-mm/ca+eeso4uo84ssnbharh4hvlnhauq5nzknkxqxrcyjninvjp...@mail.gmail.com/ > > >>> > > >>> Update for the ioctl_userfaultfd(2) manpage: > > >>> > > >>> UFFDIO_MOVE > > >>> (Since Linux xxx) Move a continuous memory chunk into the > > >>> userfault registered range and optionally wake up the blocked > > >>> thread. The source and destination addresses and the number of > > >>> bytes to move are specified by the src, dst, and len fields of > > >>> the uffdio_move structure pointed to by argp: > > >>> > > >>> struct uffdio_move { > > >>> __u64 dst; /* Destination of move */ > > >>> __u64 src; /* Source of move */ > > >>> __u64 len; /* Number of bytes to move */ > > >>> __u64 mode; /* Flags controlling behavior of move */ > > >>> __s64 move; /* Number of bytes moved, or negated > > >>> error */ > > >>> }; > > >>> > > >>> The following value may be bitwise ORed in mode to change the > > >>> behavior of the UFFDIO_MOVE operation: > > >>> > > >>> UFFDIO_MOVE_MODE_DONTWAKE > > >>> Do not wake up the thread that waits for page-fault > > >>> resolution > > >>> > > >>> UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES > > >>> Allow holes in the source virtual range that is being > > >>> moved. > > >>> When not specified, the holes will result in ENOENT > > >>> error. > > >>> When specified, the holes will be accounted as > > >>> successfully > > >>> moved memory. This is mostly useful to move hugepage > > >>> aligned > > >>> virtual regions without knowing if there are transparent > > >>> hugepages in the regions or not, but preventing the > > >>> risk of > > >>> having to split the hugepage during the operation. > > >>> > > >>> The move field is used by the kernel to return the number of > > >>> bytes that was actually moved, or an error (a negated errno- > > >>> style value). If the value returned in move doesn't match the > > >>> value that was specified in len, the operation fails with the > > >>> error EAGAIN. The move field is output-only; it is not read by > > >>> the UFFDIO_MOVE operation. > > >>> > > >>> The operation may fail for various reasons. Usually, remapping > > >>> of > > >>> pages that are not exclusive to the given process fail; once > > >>> KSM > > >>> might deduplicate pages or fork() COW-shares pages during > > >>> fork() > > >>> with child processes, they are no longer exclusive. Further, > > >>> the > > >>> kernel might only perform lightweight checks for detecting > > >>> whether > > >>> the pages are exclusive, and return -EBUSY in case that check > > >>> fails. > > >>> To make the operation more likely to succeed, KSM should be > > >>> disabled, fork() should be avoided or MADV_DONTFORK should be > > >>> configured for the source VMA before fork(). > > >>> > > >>> This ioctl(2) operation returns 0 on success. In this case, > > >>> the > > >>> entire area was moved. On error, -1 is returned and errno is > > >>> set to indicate the error. Possible errors include: > > >>> > > >>> EAGAIN The number of bytes moved (i.e., the value returned in > > >>> the move field) does not equal the value that was > > >>> specified in the len field. > > >>> > > >>> EINVAL Either dst or len was not a multiple of the system page > > >>> size, or the range specified by src and len or dst and > > >>> len > > >>> was invalid. > > >>> > > >>> EINVAL An invalid bit was specified in the mode field. > > >>> > > >>> ENOENT > > >>> The source virtual memory range has unmapped holes and > > >>> UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set. > > >>> > > >>> EEXIST > > >>> The destination virtual memory range is fully or > > >>> partially > > >>> mapped. > > >>> > > >>> EBUSY > > >>> The pages in the source virtual memory range are not > > >>> exclusive to the process. The kernel might only perform > > >>> lightweight checks for detecting whether the pages are > > >>> exclusive. To make the operation more likely to succeed, > > >>> KSM should be disabled, fork() should be avoided or > > >>> MADV_DONTFORK should be configured for the source > > >>> virtual > > >>> memory area before fork(). > > >>> > > >>> ENOMEM Allocating memory needed for the operation failed. > > >>> > > >>> ESRCH > > >>> The faulting process has exited at the time of a > > >>> UFFDIO_MOVE operation. > > >>> > > >> > > >> A general comment simply because I realized that just now: does anything > > >> speak against limiting the operations now to a single MM? > > >> > > >> The use cases I heard so far don't need it. If ever required, we could > > >> consider extending it. > > >> > > >> Let's reduce complexity and KIS unless really required. > > > > > > Let me check if there are use cases that require moves between MMs. > > > Andrea seems to have put considerable effort to make it work between > > > MMs and it would be a pity to lose that. I can send a follow-up patch > > > to recover that functionality and even if it does not get merged, it > > > can be used in the future as a reference. But first let me check if we > > > can drop it. > > For the compaction use case that we have it's fine to limit it to > single MM. However, for general use I think Peter will have a better > idea. > > > > Yes, that sounds reasonable. Unless the big important use cases requires > > moving pages between processes, let's leave that as future work for now. > > > > -- > > Cheers, > > > > David / dhildenb > >
While going through mremap's move_page_tables code, which is pretty similar to what we do here, I noticed that cache is flushed as well, whereas we are not doing that here. Is that OK? I'm not a MM expert by any means, so it's a question rather than a comment :)
