On 24.02.20 23:49, Peter Xu wrote: > On Fri, Feb 21, 2020 at 05:42:04PM +0100, David Hildenbrand wrote: >> When we partially change mappings (esp., mmap over parts of an existing >> mmap like qemu_ram_remap() does) where we have a userfaultfd handler >> registered, the handler will implicitly be unregistered from the parts that >> changed. >> >> Trying to place pages onto mappings where there is no longer a handler >> registered will fail. Let's make sure that any waiter is woken up - we >> have to do that manually. >> >> Let's also document how UFFDIO_UNREGISTER will handle this scenario. >> >> This is mainly a preparation for RAM blocks with resizable allcoations, >> where the mapping of the invalid RAM range will change. The source will >> keep sending pages that are outside of the new (shrunk) RAM size. We have >> to treat these pages like they would have been migrated, but can >> essentially simply drop the content (ignore the placement error). >> >> Keep printing a warning on EINVAL, to avoid hiding other (programming) >> issues. ENOENT is unique. >> >> Cc: "Dr. David Alan Gilbert" <dgilb...@redhat.com> >> Cc: Juan Quintela <quint...@redhat.com> >> Cc: Peter Xu <pet...@redhat.com> >> Cc: Andrea Arcangeli <aarca...@redhat.com> >> Signed-off-by: David Hildenbrand <da...@redhat.com> >> --- >> migration/postcopy-ram.c | 37 +++++++++++++++++++++++++++++++++++++ >> 1 file changed, 37 insertions(+) >> >> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c >> index c68caf4e42..f023830b9a 100644 >> --- a/migration/postcopy-ram.c >> +++ b/migration/postcopy-ram.c >> @@ -506,6 +506,12 @@ static int cleanup_range(RAMBlock *rb, void *opaque) >> range_struct.start = (uintptr_t)host_addr; >> range_struct.len = length; >> >> + /* >> + * In case the mapping was partially changed since we enabled userfault >> + * (e.g., via qemu_ram_remap()), the userfaultfd handler was already >> removed >> + * for the mappings that changed. Unregistering will, however, still >> work >> + * and ignore mappings without a registered handler. >> + */ > > Ideally we should still only unregister what we have registered. > After all we do have this information because we know what we > registered, we know what has unmapped (in your new resize() hook, when > postcopy_state==RUNNING).
Not in the case of qemu_ram_remap(). And whatever you propose will require synchronization (see my other mail) and more complicated handling than this. uffd allows you to handle races with mmap changes in a very elegant way (e.g., -ENOENT, or unregisterignoring changed mappings). > > An extreme example is when we register with pages in range [A, B), > then shrink it to [A, C), then we mapped something else within [C, B) > (note, with virtio-mem logically B can be very big and C can be very > small, it means [B, C) can cover quite some address space). Then if: > > - [C, B) memory type is not compatible with uffd, or That will never happen in the near future. Without resizable allocations: - All memory is either anonymous or from a single fd In addition, right now, only anonymous memory can be used for resizable RAM. However, with resizable allocations we could have: - All used_length memory is either anonymous or from a single fd - All remaining memory is either anonymous or from a single fd Everything else does not make any sense IMHO and I don't think this is relevant long term. You cannot arbitrarily map things into the used_length part of a RAMBlock. That would contradict to its page_size and its fd. E.g., you would break qemu_ram_remap(). > > - [C, B) could be registered with uffd again due to some other > reason (so far QEMU should not have such a reason) Any code that wants to make use of uffd properly has to synchronize against postcopy code either way IMHO. It just doesn't work otherwise. E.g., once I would use it to protect unplugged memory in virtio-mem (something I am looking into right now and teaching QEMU not to touch all RAMBlock memory is complicated :) ), virtio-mem would unregister any uffd handler when notified that postcopy will start, and re-register after postcopy finished. [...] >> >> +static int qemu_ufd_wake_ioctl(int userfault_fd, void *host_addr, >> + uint64_t pagesize) >> +{ >> + struct uffdio_range range = { >> + .start = (uint64_t)(uintptr_t)host_addr, >> + .len = pagesize, >> + }; >> + >> + return ioctl(userfault_fd, UFFDIO_WAKE, &range); >> +} >> + >> static int qemu_ufd_copy_ioctl(int userfault_fd, void *host_addr, >> void *from_addr, uint64_t pagesize, RAMBlock >> *rb) >> { >> @@ -1198,6 +1215,26 @@ static int qemu_ufd_copy_ioctl(int userfault_fd, void >> *host_addr, >> zero_struct.mode = 0; >> ret = ioctl(userfault_fd, UFFDIO_ZEROPAGE, &zero_struct); >> } >> + >> + /* >> + * When the mapping gets partially changed (e.g., qemu_ram_remap()) >> before >> + * we try to place a page, the userfaultfd handler will be removed for >> the >> + * changed mappings and placing pages will fail. We can safely ignore >> this, >> + * because mappings that changed on the destination don't need data >> from the >> + * source (e.g., qemu_ram_remap()). Wake up any waiter waiting for that >> page >> + * (unlikely but possible). Waking up waiters is always possible, even >> + * without a registered userfaultfd handler. >> + * >> + * Old kernels report EINVAL, new kernels report ENOENT in case there is >> + * no longer a userfaultfd handler for a mapping. >> + */ >> + if (ret && (errno == ENOENT || errno == EINVAL)) { >> + if (errno == EINVAL) { >> + warn_report("%s: Failed to place page %p. Waking up any >> waiters.", >> + __func__, host_addr); >> + } >> + ret = qemu_ufd_wake_ioctl(userfault_fd, host_addr, pagesize); > > ... if with above information (takes notes on where we registered > uffd), I think we don't need to capture error, but we can simply skip > those outliers. I think we could skip them in general. Nobody should be touching that memory. But debugging e.g., a SEGFAULT is easier than debugging some sleeping thread IMHO. -- Thanks, David / dhildenb