Re: Reducing vdpa migration downtime because of memory pin / maps

Eugenio Perez Martin Wed, 19 Jul 2023 03:41:16 -0700

On Mon, Jul 17, 2023 at 9:57 PM Si-Wei Liu <si-wei....@oracle.com> wrote:
>
> Hey,
>
> I am now back from the break. Sorry for the delayed response, please see
> in line.
>
> On 7/9/2023 11:04 PM, Eugenio Perez Martin wrote:
> > On Sat, Jul 8, 2023 at 11:14 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
> >>
> >>
> >> On 7/5/2023 10:46 PM, Eugenio Perez Martin wrote:
> >>> On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
> >>>>
> >>>> On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote:
> >>>>> On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei....@oracle.com> 
> >>>>> wrote:
> >>>>>> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei....@oracle.com> 
> >>>>>>> wrote:
> >>>>>>>> On 6/7/23 01:08, Eugenio Perez Martin wrote:
> >>>>>>>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei....@oracle.com> 
> >>>>>>>>> wrote:
> >>>>>>>>>> Sorry for reviving this old thread, I lost the best timing to 
> >>>>>>>>>> follow up
> >>>>>>>>>> on this while I was on vacation. I have been working on this and 
> >>>>>>>>>> found
> >>>>>>>>>> out some discrepancy, please see below.
> >>>>>>>>>>
> >>>>>>>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
> >>>>>>>>>>> Hi!
> >>>>>>>>>>>
> >>>>>>>>>>> As mentioned in the last upstream virtio-networking meeting, one 
> >>>>>>>>>>> of
> >>>>>>>>>>> the factors that adds more downtime to migration is the handling 
> >>>>>>>>>>> of
> >>>>>>>>>>> the guest memory (pin, map, etc). At this moment this handling is
> >>>>>>>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, 
> >>>>>>>>>>> the
> >>>>>>>>>>> destination device waits until all the guest memory / state is
> >>>>>>>>>>> migrated to start pinning all the memory.
> >>>>>>>>>>>
> >>>>>>>>>>> The proposal is to bind it to the char device life cycle (open vs
> >>>>>>>>>>> close),
> >>>>>>>>>> Hmmm, really? If it's the life cycle for char device, the next 
> >>>>>>>>>> guest /
> >>>>>>>>>> qemu launch on the same vhost-vdpa device node won't make it work.
> >>>>>>>>>>
> >>>>>>>>> Maybe my sentence was not accurate, but I think we're on the same 
> >>>>>>>>> page here.
> >>>>>>>>>
> >>>>>>>>> Two qemu instances opening the same char device at the same time are
> >>>>>>>>> not allowed, and vhost_vdpa_release clean all the maps. So the next
> >>>>>>>>> qemu that opens the char device should see a clean device anyway.
> >>>>>>>> I mean the pin can't be done at the time of char device open, where 
> >>>>>>>> the
> >>>>>>>> user address space is not known/bound yet. The earliest point 
> >>>>>>>> possible
> >>>>>>>> for pinning would be until the vhost_attach_mm() call from SET_OWNER 
> >>>>>>>> is
> >>>>>>>> done.
> >>>>>>> Maybe we are deviating, let me start again.
> >>>>>>>
> >>>>>>> Using QEMU code, what I'm proposing is to modify the lifecycle of the
> >>>>>>> .listener member of struct vhost_vdpa.
> >>>>>>>
> >>>>>>> At this moment, the memory listener is registered at
> >>>>>>> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
> >>>>>>> and is unregistered in both vhost_vdpa_reset_status and
> >>>>>>> vhost_vdpa_cleanup.
> >>>>>>>
> >>>>>>> My original proposal was just to move the memory listener registration
> >>>>>>> to the last vhost_vdpa_init, and remove the unregister from
> >>>>>>> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
> >>>>>>> be the same, the device should not realize this change.
> >>>>>> This can address LM downtime latency for sure, but it won't help
> >>>>>> downtime during dynamic SVQ switch - which still needs to go through 
> >>>>>> the
> >>>>>> full unmap/map cycle (that includes the slow part for pinning) from
> >>>>>> passthrough to SVQ mode. Be noted not every device could work with a
> >>>>>> separate ASID for SVQ descriptors. The fix should expect to work on
> >>>>>> normal vDPA vendor devices without a separate descriptor ASID, with
> >>>>>> platform IOMMU underneath or with on-chip IOMMU.
> >>>>>>
> >>>>> At this moment the SVQ switch is very inefficient mapping-wise, as it
> >>>>> unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
> >>>>> allocated in low regions of the iova space, and then the guest memory
> >>>>> is allocated in this new IOVA region incrementally.
> >>>> Yep. The key to build this fast path for SVQ switching I think is to
> >>>> maintain the identity mapping for the passthrough queues so that QEMU
> >>>> can reuse the old mappings for guest memory (e.g. GIOVA identity mapped
> >>>> to GPA) while incrementally adding new mappings for SVQ vrings.
> >>>>
> >>>>> We can optimize that if we place SVQ in a free GPA area instead.
> >>>> Here's a question though: it might not be hard to find a free GPA range
> >>>> for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit
> >>>> ranges), but I'm not sure if easy to find a free GIOVA range for the
> >>>> vIOMMU case - particularly this has to work in the same entire 64bit
> >>>> IOVA address ranges that (for now) QEMU won't be able to "reserve" a
> >>>> specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be
> >>>> done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and
> >>>> virito-iommu) so that we can call it out as a generic means for SVQ
> >>>> switching optimization?
> >>>>
> >>> In the case vIOMMU allocates a new block we will use the same algorithm 
> >>> as now:
> >>> * Find a new free IOVA chunk of the same size
> >>> * Map this new SVQ IOVA, that may or may not be the same as SVQ
> >>>
> >>> Since we must go through the translation phase to sanitize guest's
> >>> available descriptors anyway, it has zero added cost.
> >> Not sure I followed, this can work but doesn't seem able to reuse the
> >> old host kernel mappings for guest memory, hence still requires remap of
> >> the entire host IOVA ranges when SVQ IOVA comes along. I think by
> >> maintaining 1:1 identity map on guest memory, we don't have to bother
> >> tearing down existing HVA->HPA mappings in kernel thus save the
> >> expensive pinning calls at large. I don't clearly see under this scheme,
> >> how the new SVQ IOVA may work with potential conflict on IOVA space from
> >> hotplugged memory - in this case the 1:1 IOVA->GPA identity guest memory
> >> mapping can't be kept.
> >>
> > There is no need to maintain the 1:1 for memory mapped after the
> > pinning. The bigger reason to maintain them is to reduce the downtime
> > because of pinning.
> Yes, if guest users don't care about SVQ switching downtime there's no
> need to maintain 1:1, and SVQ translation like today should work for
> them. However most live migration users who care about downtime during
> live migration also care about the downtime for SVQ switching - you
> don't want to say that brownout time like 300ms or so in the mid of live
> migration at the cost of complete service loss of 4 to 5 seconds at the
> start of migration is a win to them. What I said above has the
> presumption that we both are looking at (the possibility of) a
> generic/software way to optimize/reduce pinning overhead on the downtime
> - say what can be done at QEMU level or host kernel to avoid pinning (at
> SVQ switch) rather than put burden on every hardware vendor to implement
> a separate ASID for SVQ.
>


Your assumption is right, let's talk in the example as I think it
would be easier to continue this.

> > After that, we can reuse the method we're using at
> > this moment, looking for a new free hole for the new map. ew only need
> > to pin the new map.
> Consider this sequence:
> - initially host device uses original GPA for guest memory mapping
> - live migration starts off, QEMU's iova tree maps guest memory GPA 1:1
> to IOVA

GPA is already mapped & pinned, there is no need to map it again. all
the remapping is done at the moment, and what I'm proposing is to not
to do so.

> - SVQ new maps allocated from a free hole on iova tree that happen to
> fall just above the IOVA range for guest memory GPA
> - new memory hot plugged to guest while migration is still going on in
> source host. although this hot plugged memory region sits right above
> the guest memory blocks in guest memory space (GPA), it will get new map
> in a separate range (not 1:1 mapped to GPA any more!) from the QEMU iova
> tree. Since QEMU mediates and translates virtqueue access to memory via
> SVQ (all guest memory maps to the same IOVA for SVQ), so far so good
> - for some reason live migration fails before VM is able to be migrated
> to the destination host, VM has to resume from the source host immediately
> - since live migration had failed, QEMU will unmap SVQ vrings from the
> IOVA tree and then stop shadowing virtqueue access
> - the host vDPA device now has incorrect passthrough GPA mapping for the
> part of newly plugged guest memory, as that belongs to the IOVA range
> where SVQ translation is in use
>

I'm assuming we can also support <300ms downtime in that case of
memory hotplug, and that the pin / unpin of SVQ regions is bearable.

As a first step, we can also move the SVQ vrings IOVA in the event of
hot-plugging memory. but that would require a restart of the device
unless RING_RESET is supported. Would that improve the situation?

Another solution is to effectively be smarter in the case of a DMA
remap, and not unpin memory that is going to be pin anyway.

> Although it might not be easy for SVQ vrings and plugged memory to
> collide on the same GPA/IOVA range, this is something we should prevent
> from happening in the first place I'd assume. You can say we only look
> from higher IOVA ranges to map SVQ so that the lower range can be
> reserved for latent hot plug memory, but this still needs complicated
> implementation to deal with IOVA range fragmentation and special
> collision prevention from breaking the 1:1 map to guest memory space.
>
> Regards,
> -Siwei
>
> >
> >>> Another option would be to move the SVQ vring to a new region, but I
> >>> don't see any advantage on maintaining 1:1 mapping at that point.
> >> See above. For pinning avoidance point of view (i.e. QEMU software
> >> optimization on SVQ switching downtime), I think it's crucial to
> >> maintain 1:1 mapping and move SVQ vring to a specific region. But I'm
> >> not sure how complex it will be for QEMU to get it implemented in a
> >> clean way.
> >>
> >> Thanks,
> >> -Siwei
> >>
> >>>> If this QEMU/vIOMMU "hack" is not universally feasible, I would rather
> >>>> build a fast path in the kernel via a new vhost IOTLB command, say
> >>>> INVALIDATE_AND_UPDATE_ALL, to atomically flush all existing
> >>>> (passthrough) mappings and update to use the SVQ ones in a single batch,
> >>>> while keeping the pages for guest memory always pinned (the kernel will
> >>>> make this decision). This doesn't expose pinning to userspace, and can
> >>>> also fix downtime issue.
> >>>>
> >>>>>     All
> >>>>> of the "translations" still need to be done, to ensure the guest
> >>>>> doesn't have access to SVQ vring. That way, qemu will not send all the
> >>>>> unmaps & maps, only the new ones. And vhost/vdpa does not need to call
> >>>>> unpin_user_page / pin_user_pages for all the guest memory.
> >>>>>
> >>>>> More optimizations include the batching of the SVQ vrings.
> >>>> Nods.
> >>>>
> >>>>>>> One of the concerns was that it could delay VM initialization, and I
> >>>>>>> didn't profile it but I think that may be the case.
> >>>>>> Yes, that's the concern here - we should not introduce regression to
> >>>>>> normal VM boot process/time. In case of large VM it's very easy to see
> >>>>>> the side effect if we go this way.
> >>>>>>
> >>>>>>>      I'm not sure about
> >>>>>>> the right fix but I think the change is easy to profile. If that is
> >>>>>>> the case, we could:
> >>>>>>> * use a flag (listener->address_space ?) and only register the
> >>>>>>> listener in _init if waiting for a migration, do it in _start
> >>>>>>> otherwise.
> >>>>>> Just doing this alone won't help SVQ mode switch downtime, see the
> >>>>>> reason stated above.
> >>>>>>
> >>>>>>> * something like io_uring, where the map can be done in parallel and
> >>>>>>> we can fail _start if some of them fails.
> >>>>>> This can alleviate the problem somehow, but still sub-optimal and not
> >>>>>> scalable with larger size. I'd like zero or least pinning to be done at
> >>>>>> the SVQ switch or migration time.
> >>>>>>
> >>>>> To reduce even further the pinning at SVQ time we would need to
> >>>>> preallocate SVQ vrings before suspending the device.
> >>>> Yep. Preallocate SVQ vrings at the start of migration, but before
> >>>> suspending the device. This will work under the assumption that we can
> >>>> reserve or "steal" some ranges from the GPA or GIOVA space...
> >>>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>>>>>> Actually I think the counterpart vhost_detach_mm() only gets
> >>>>>>>> handled in vhost_vdpa_release() at device close time is a resulting
> >>>>>>>> artifact and amiss of today's vhost protocol - the opposite 
> >>>>>>>> RESET_OWNER
> >>>>>>>> call is not made mandatory hence only seen implemented in vhost-net
> >>>>>>>> device today. One qemu instance could well exec(3) another new qemu
> >>>>>>>> instance to live upgrade itself while keeping all emulated devices 
> >>>>>>>> and
> >>>>>>>> guest alive. The current vhost design simply prohibits this from 
> >>>>>>>> happening.
> >>>>>>>>
> >>>>>>> Ok, I was not aware of this. Thanks for explaining it!
> >>>>>>>
> >>>>>>>>>>>        so all the guest memory can be pinned for all the guest / 
> >>>>>>>>>>> qemu
> >>>>>>>>>>> lifecycle.
> >>>>>>>>>> I think to tie pinning to guest / qemu process life cycle makes 
> >>>>>>>>>> more
> >>>>>>>>>> sense. Essentially this pinning part needs to be decoupled from the
> >>>>>>>>>> iotlb mapping abstraction layer, and can / should work as a 
> >>>>>>>>>> standalone
> >>>>>>>>>> uAPI. Such that QEMU at the destination may launch and pin all 
> >>>>>>>>>> guest's
> >>>>>>>>>> memory as needed without having to start the device, while 
> >>>>>>>>>> awaiting any
> >>>>>>>>>> incoming migration request. Though problem is, there's no existing 
> >>>>>>>>>> vhost
> >>>>>>>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
> >>>>>>>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection 
> >>>>>>>>>> against
> >>>>>>>>>> introducing a new but clean vhost uAPI for pinning guest pages, 
> >>>>>>>>>> subject
> >>>>>>>>>> to guest's life cycle?
> >>>>>>>>>>
> >>>>>>>>> I think that to pin or not pin memory maps should be a kernel
> >>>>>>>>> decision, not to be driven by qemu.
> >>>>>>>> It's kernel decision for sure. I am with this part.
> >>>>>>>>
> >>>>>>>>> I'm not against it if needed, but
> >>>>>>>>> let me know if the current "clean at close" address your concerns.
> >>>>>>>> To better facilitate QEMU exec (live update) case, I propose we add 
> >>>>>>>> new
> >>>>>>>> vhost uAPI pair for explicit pinning request - which would live with
> >>>>>>>> user mm's, or more precisely qemu instance's lifecycle.
> >>>>>>>>
> >>>>>>> Ok I see your problem better now, but I think it should be solved at
> >>>>>>> kernel level. Does that live update need to forcefully unpin and pin
> >>>>>>> the memory again,
> >>>>>> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the
> >>>>>> downtime expectation. The exec(3)'d process should inherit the page
> >>>>>> pinning and/or mlock accounting from the original QEMU process, while
> >>>>>> keeping original page pinning intact. Physical page mappings for DMA 
> >>>>>> can
> >>>>>> be kept as is to avoid the need of reprogramming device, though in this
> >>>>>> case the existing vhost iotlb entries should be updated to reflect the
> >>>>>> new HVA in the exec(3)'d QEMU process.
> >>>>>>
> >>>>>>>      or that is just a consequence of how it works the
> >>>>>>> memory listener right now?
> >>>>>>>
> >>>>>>> Why not extend the RESET_OWNER to the rest of devices? It seems the
> >>>>>>> most natural way to me.
> >>>>>> Not sure, I think RESET_OWNER might be too heavy weighted to implement
> >>>>>> live update, and people are not clear what the exact semantics are by
> >>>>>> using it (which part of the device state is being reset, and how
> >>>>>> much)... In addition, people working on iommufd intended to make this a
> >>>>>> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:
> >>>>>>
> >>>>>> https://lore.kernel.org/kvm/y5ibvv9pnmifi...@ziepe.ca/
> >>>>>>
> >>>>>> New uAPI to just change ownership of mm seems a better fit to me...
> >>>>>>
> >>>>> I'm not sure about the right solution here, but there are other
> >>>>> proposals to batch ioctls. But maybe creating a new one fits better.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>>
> >>>>>>>>>> Another concern is the use_va stuff, originally it tags to the 
> >>>>>>>>>> device
> >>>>>>>>>> level and is made static at the time of device instantiation, 
> >>>>>>>>>> which is
> >>>>>>>>>> fine. But others to come just find a new home at per-group level or
> >>>>>>>>>> per-vq level struct. Hard to tell whether or not pinning is 
> >>>>>>>>>> actually
> >>>>>>>>>> needed for the latter use_va friends, as they are essentially tied 
> >>>>>>>>>> to
> >>>>>>>>>> the virtio life cycle or feature negotiation. While guest / Qemu 
> >>>>>>>>>> starts
> >>>>>>>>>> way earlier than that. Perhaps just ignore those sub-device level 
> >>>>>>>>>> use_va
> >>>>>>>>>> usages? Presumably !use_va at the device level is sufficient to 
> >>>>>>>>>> infer
> >>>>>>>>>> the need of pinning for device?
> >>>>>>>>>>
> >>>>>>>>> I don't follow this. But I have the feeling that the subject of my
> >>>>>>>>> original mail is way more accurate if I would have said just "memory
> >>>>>>>>> maps".
> >>>>>>>> I think the iotlb layer in vhost-vdpa just provides the abstraction 
> >>>>>>>> for
> >>>>>>>> mapping, not pinning. Although in some case mapping implicitly 
> >>>>>>>> relies on
> >>>>>>>> pinning for DMA purpose, it doesn't have to tie to that in uAPI
> >>>>>>>> semantics. We can do explicit on-demand pinning for cases for e.g.
> >>>>>>>> warming up device at live migration destination.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> I still consider the way to fix it is to actually delegate that to 
> >>>>>>>>> the
> >>>>>>>>> kernel vdpa, so it can choose if a particular ASID needs the pin or
> >>>>>>>>> not. But let me know if I missed something.
> >>>>>>>> You can disregard this for now. I will discuss that further with you
> >>>>>>>> guys while bind_mm and per-group use_va stuffs are landed.
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>> -Siwei
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> -Siwei
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> This has two main problems:
> >>>>>>>>>>> * At this moment the reset semantics forces the vdpa device to 
> >>>>>>>>>>> unmap
> >>>>>>>>>>> all the memory. So this change needs a vhost vdpa feature flag.
> >>>>>>>>>>> * This may increase the initialization time. Maybe we can delay 
> >>>>>>>>>>> it if
> >>>>>>>>>>> qemu is not the destination of a LM. Anyway I think this should be
> >>>>>>>>>>> done as an optimization on top.
> >>>>>>>>>>>
> >>>>>>>>>>> Any ideas or comments in this regard?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks!
> >>>>>>>>>>>
>

Re: Reducing vdpa migration downtime because of memory pin / maps

Reply via email to