On Mon, Jul 17, 2023 at 9:57 PM Si-Wei Liu <si-wei....@oracle.com> wrote: > > Hey, > > I am now back from the break. Sorry for the delayed response, please see > in line. > > On 7/9/2023 11:04 PM, Eugenio Perez Martin wrote: > > On Sat, Jul 8, 2023 at 11:14 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > >> > >> > >> On 7/5/2023 10:46 PM, Eugenio Perez Martin wrote: > >>> On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > >>>> > >>>> On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote: > >>>>> On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei....@oracle.com> > >>>>> wrote: > >>>>>> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote: > >>>>>>> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei....@oracle.com> > >>>>>>> wrote: > >>>>>>>> On 6/7/23 01:08, Eugenio Perez Martin wrote: > >>>>>>>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei....@oracle.com> > >>>>>>>>> wrote: > >>>>>>>>>> Sorry for reviving this old thread, I lost the best timing to > >>>>>>>>>> follow up > >>>>>>>>>> on this while I was on vacation. I have been working on this and > >>>>>>>>>> found > >>>>>>>>>> out some discrepancy, please see below. > >>>>>>>>>> > >>>>>>>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote: > >>>>>>>>>>> Hi! > >>>>>>>>>>> > >>>>>>>>>>> As mentioned in the last upstream virtio-networking meeting, one > >>>>>>>>>>> of > >>>>>>>>>>> the factors that adds more downtime to migration is the handling > >>>>>>>>>>> of > >>>>>>>>>>> the guest memory (pin, map, etc). At this moment this handling is > >>>>>>>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, > >>>>>>>>>>> the > >>>>>>>>>>> destination device waits until all the guest memory / state is > >>>>>>>>>>> migrated to start pinning all the memory. > >>>>>>>>>>> > >>>>>>>>>>> The proposal is to bind it to the char device life cycle (open vs > >>>>>>>>>>> close), > >>>>>>>>>> Hmmm, really? If it's the life cycle for char device, the next > >>>>>>>>>> guest / > >>>>>>>>>> qemu launch on the same vhost-vdpa device node won't make it work. > >>>>>>>>>> > >>>>>>>>> Maybe my sentence was not accurate, but I think we're on the same > >>>>>>>>> page here. > >>>>>>>>> > >>>>>>>>> Two qemu instances opening the same char device at the same time are > >>>>>>>>> not allowed, and vhost_vdpa_release clean all the maps. So the next > >>>>>>>>> qemu that opens the char device should see a clean device anyway. > >>>>>>>> I mean the pin can't be done at the time of char device open, where > >>>>>>>> the > >>>>>>>> user address space is not known/bound yet. The earliest point > >>>>>>>> possible > >>>>>>>> for pinning would be until the vhost_attach_mm() call from SET_OWNER > >>>>>>>> is > >>>>>>>> done. > >>>>>>> Maybe we are deviating, let me start again. > >>>>>>> > >>>>>>> Using QEMU code, what I'm proposing is to modify the lifecycle of the > >>>>>>> .listener member of struct vhost_vdpa. > >>>>>>> > >>>>>>> At this moment, the memory listener is registered at > >>>>>>> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev, > >>>>>>> and is unregistered in both vhost_vdpa_reset_status and > >>>>>>> vhost_vdpa_cleanup. > >>>>>>> > >>>>>>> My original proposal was just to move the memory listener registration > >>>>>>> to the last vhost_vdpa_init, and remove the unregister from > >>>>>>> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would > >>>>>>> be the same, the device should not realize this change. > >>>>>> This can address LM downtime latency for sure, but it won't help > >>>>>> downtime during dynamic SVQ switch - which still needs to go through > >>>>>> the > >>>>>> full unmap/map cycle (that includes the slow part for pinning) from > >>>>>> passthrough to SVQ mode. Be noted not every device could work with a > >>>>>> separate ASID for SVQ descriptors. The fix should expect to work on > >>>>>> normal vDPA vendor devices without a separate descriptor ASID, with > >>>>>> platform IOMMU underneath or with on-chip IOMMU. > >>>>>> > >>>>> At this moment the SVQ switch is very inefficient mapping-wise, as it > >>>>> unmap all the GPA->HVA maps and overrides it. In particular, SVQ is > >>>>> allocated in low regions of the iova space, and then the guest memory > >>>>> is allocated in this new IOVA region incrementally. > >>>> Yep. The key to build this fast path for SVQ switching I think is to > >>>> maintain the identity mapping for the passthrough queues so that QEMU > >>>> can reuse the old mappings for guest memory (e.g. GIOVA identity mapped > >>>> to GPA) while incrementally adding new mappings for SVQ vrings. > >>>> > >>>>> We can optimize that if we place SVQ in a free GPA area instead. > >>>> Here's a question though: it might not be hard to find a free GPA range > >>>> for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit > >>>> ranges), but I'm not sure if easy to find a free GIOVA range for the > >>>> vIOMMU case - particularly this has to work in the same entire 64bit > >>>> IOVA address ranges that (for now) QEMU won't be able to "reserve" a > >>>> specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be > >>>> done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and > >>>> virito-iommu) so that we can call it out as a generic means for SVQ > >>>> switching optimization? > >>>> > >>> In the case vIOMMU allocates a new block we will use the same algorithm > >>> as now: > >>> * Find a new free IOVA chunk of the same size > >>> * Map this new SVQ IOVA, that may or may not be the same as SVQ > >>> > >>> Since we must go through the translation phase to sanitize guest's > >>> available descriptors anyway, it has zero added cost. > >> Not sure I followed, this can work but doesn't seem able to reuse the > >> old host kernel mappings for guest memory, hence still requires remap of > >> the entire host IOVA ranges when SVQ IOVA comes along. I think by > >> maintaining 1:1 identity map on guest memory, we don't have to bother > >> tearing down existing HVA->HPA mappings in kernel thus save the > >> expensive pinning calls at large. I don't clearly see under this scheme, > >> how the new SVQ IOVA may work with potential conflict on IOVA space from > >> hotplugged memory - in this case the 1:1 IOVA->GPA identity guest memory > >> mapping can't be kept. > >> > > There is no need to maintain the 1:1 for memory mapped after the > > pinning. The bigger reason to maintain them is to reduce the downtime > > because of pinning. > Yes, if guest users don't care about SVQ switching downtime there's no > need to maintain 1:1, and SVQ translation like today should work for > them. However most live migration users who care about downtime during > live migration also care about the downtime for SVQ switching - you > don't want to say that brownout time like 300ms or so in the mid of live > migration at the cost of complete service loss of 4 to 5 seconds at the > start of migration is a win to them. What I said above has the > presumption that we both are looking at (the possibility of) a > generic/software way to optimize/reduce pinning overhead on the downtime > - say what can be done at QEMU level or host kernel to avoid pinning (at > SVQ switch) rather than put burden on every hardware vendor to implement > a separate ASID for SVQ. >
Your assumption is right, let's talk in the example as I think it would be easier to continue this. > > After that, we can reuse the method we're using at > > this moment, looking for a new free hole for the new map. ew only need > > to pin the new map. > Consider this sequence: > - initially host device uses original GPA for guest memory mapping > - live migration starts off, QEMU's iova tree maps guest memory GPA 1:1 > to IOVA GPA is already mapped & pinned, there is no need to map it again. all the remapping is done at the moment, and what I'm proposing is to not to do so. > - SVQ new maps allocated from a free hole on iova tree that happen to > fall just above the IOVA range for guest memory GPA > - new memory hot plugged to guest while migration is still going on in > source host. although this hot plugged memory region sits right above > the guest memory blocks in guest memory space (GPA), it will get new map > in a separate range (not 1:1 mapped to GPA any more!) from the QEMU iova > tree. Since QEMU mediates and translates virtqueue access to memory via > SVQ (all guest memory maps to the same IOVA for SVQ), so far so good > - for some reason live migration fails before VM is able to be migrated > to the destination host, VM has to resume from the source host immediately > - since live migration had failed, QEMU will unmap SVQ vrings from the > IOVA tree and then stop shadowing virtqueue access > - the host vDPA device now has incorrect passthrough GPA mapping for the > part of newly plugged guest memory, as that belongs to the IOVA range > where SVQ translation is in use > I'm assuming we can also support <300ms downtime in that case of memory hotplug, and that the pin / unpin of SVQ regions is bearable. As a first step, we can also move the SVQ vrings IOVA in the event of hot-plugging memory. but that would require a restart of the device unless RING_RESET is supported. Would that improve the situation? Another solution is to effectively be smarter in the case of a DMA remap, and not unpin memory that is going to be pin anyway. > Although it might not be easy for SVQ vrings and plugged memory to > collide on the same GPA/IOVA range, this is something we should prevent > from happening in the first place I'd assume. You can say we only look > from higher IOVA ranges to map SVQ so that the lower range can be > reserved for latent hot plug memory, but this still needs complicated > implementation to deal with IOVA range fragmentation and special > collision prevention from breaking the 1:1 map to guest memory space. > > Regards, > -Siwei > > > > >>> Another option would be to move the SVQ vring to a new region, but I > >>> don't see any advantage on maintaining 1:1 mapping at that point. > >> See above. For pinning avoidance point of view (i.e. QEMU software > >> optimization on SVQ switching downtime), I think it's crucial to > >> maintain 1:1 mapping and move SVQ vring to a specific region. But I'm > >> not sure how complex it will be for QEMU to get it implemented in a > >> clean way. > >> > >> Thanks, > >> -Siwei > >> > >>>> If this QEMU/vIOMMU "hack" is not universally feasible, I would rather > >>>> build a fast path in the kernel via a new vhost IOTLB command, say > >>>> INVALIDATE_AND_UPDATE_ALL, to atomically flush all existing > >>>> (passthrough) mappings and update to use the SVQ ones in a single batch, > >>>> while keeping the pages for guest memory always pinned (the kernel will > >>>> make this decision). This doesn't expose pinning to userspace, and can > >>>> also fix downtime issue. > >>>> > >>>>> All > >>>>> of the "translations" still need to be done, to ensure the guest > >>>>> doesn't have access to SVQ vring. That way, qemu will not send all the > >>>>> unmaps & maps, only the new ones. And vhost/vdpa does not need to call > >>>>> unpin_user_page / pin_user_pages for all the guest memory. > >>>>> > >>>>> More optimizations include the batching of the SVQ vrings. > >>>> Nods. > >>>> > >>>>>>> One of the concerns was that it could delay VM initialization, and I > >>>>>>> didn't profile it but I think that may be the case. > >>>>>> Yes, that's the concern here - we should not introduce regression to > >>>>>> normal VM boot process/time. In case of large VM it's very easy to see > >>>>>> the side effect if we go this way. > >>>>>> > >>>>>>> I'm not sure about > >>>>>>> the right fix but I think the change is easy to profile. If that is > >>>>>>> the case, we could: > >>>>>>> * use a flag (listener->address_space ?) and only register the > >>>>>>> listener in _init if waiting for a migration, do it in _start > >>>>>>> otherwise. > >>>>>> Just doing this alone won't help SVQ mode switch downtime, see the > >>>>>> reason stated above. > >>>>>> > >>>>>>> * something like io_uring, where the map can be done in parallel and > >>>>>>> we can fail _start if some of them fails. > >>>>>> This can alleviate the problem somehow, but still sub-optimal and not > >>>>>> scalable with larger size. I'd like zero or least pinning to be done at > >>>>>> the SVQ switch or migration time. > >>>>>> > >>>>> To reduce even further the pinning at SVQ time we would need to > >>>>> preallocate SVQ vrings before suspending the device. > >>>> Yep. Preallocate SVQ vrings at the start of migration, but before > >>>> suspending the device. This will work under the assumption that we can > >>>> reserve or "steal" some ranges from the GPA or GIOVA space... > >>>> > >>>> Thanks, > >>>> -Siwei > >>>> > >>>>>>>> Actually I think the counterpart vhost_detach_mm() only gets > >>>>>>>> handled in vhost_vdpa_release() at device close time is a resulting > >>>>>>>> artifact and amiss of today's vhost protocol - the opposite > >>>>>>>> RESET_OWNER > >>>>>>>> call is not made mandatory hence only seen implemented in vhost-net > >>>>>>>> device today. One qemu instance could well exec(3) another new qemu > >>>>>>>> instance to live upgrade itself while keeping all emulated devices > >>>>>>>> and > >>>>>>>> guest alive. The current vhost design simply prohibits this from > >>>>>>>> happening. > >>>>>>>> > >>>>>>> Ok, I was not aware of this. Thanks for explaining it! > >>>>>>> > >>>>>>>>>>> so all the guest memory can be pinned for all the guest / > >>>>>>>>>>> qemu > >>>>>>>>>>> lifecycle. > >>>>>>>>>> I think to tie pinning to guest / qemu process life cycle makes > >>>>>>>>>> more > >>>>>>>>>> sense. Essentially this pinning part needs to be decoupled from the > >>>>>>>>>> iotlb mapping abstraction layer, and can / should work as a > >>>>>>>>>> standalone > >>>>>>>>>> uAPI. Such that QEMU at the destination may launch and pin all > >>>>>>>>>> guest's > >>>>>>>>>> memory as needed without having to start the device, while > >>>>>>>>>> awaiting any > >>>>>>>>>> incoming migration request. Though problem is, there's no existing > >>>>>>>>>> vhost > >>>>>>>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER / > >>>>>>>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection > >>>>>>>>>> against > >>>>>>>>>> introducing a new but clean vhost uAPI for pinning guest pages, > >>>>>>>>>> subject > >>>>>>>>>> to guest's life cycle? > >>>>>>>>>> > >>>>>>>>> I think that to pin or not pin memory maps should be a kernel > >>>>>>>>> decision, not to be driven by qemu. > >>>>>>>> It's kernel decision for sure. I am with this part. > >>>>>>>> > >>>>>>>>> I'm not against it if needed, but > >>>>>>>>> let me know if the current "clean at close" address your concerns. > >>>>>>>> To better facilitate QEMU exec (live update) case, I propose we add > >>>>>>>> new > >>>>>>>> vhost uAPI pair for explicit pinning request - which would live with > >>>>>>>> user mm's, or more precisely qemu instance's lifecycle. > >>>>>>>> > >>>>>>> Ok I see your problem better now, but I think it should be solved at > >>>>>>> kernel level. Does that live update need to forcefully unpin and pin > >>>>>>> the memory again, > >>>>>> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the > >>>>>> downtime expectation. The exec(3)'d process should inherit the page > >>>>>> pinning and/or mlock accounting from the original QEMU process, while > >>>>>> keeping original page pinning intact. Physical page mappings for DMA > >>>>>> can > >>>>>> be kept as is to avoid the need of reprogramming device, though in this > >>>>>> case the existing vhost iotlb entries should be updated to reflect the > >>>>>> new HVA in the exec(3)'d QEMU process. > >>>>>> > >>>>>>> or that is just a consequence of how it works the > >>>>>>> memory listener right now? > >>>>>>> > >>>>>>> Why not extend the RESET_OWNER to the rest of devices? It seems the > >>>>>>> most natural way to me. > >>>>>> Not sure, I think RESET_OWNER might be too heavy weighted to implement > >>>>>> live update, and people are not clear what the exact semantics are by > >>>>>> using it (which part of the device state is being reset, and how > >>>>>> much)... In addition, people working on iommufd intended to make this a > >>>>>> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER: > >>>>>> > >>>>>> https://lore.kernel.org/kvm/y5ibvv9pnmifi...@ziepe.ca/ > >>>>>> > >>>>>> New uAPI to just change ownership of mm seems a better fit to me... > >>>>>> > >>>>> I'm not sure about the right solution here, but there are other > >>>>> proposals to batch ioctls. But maybe creating a new one fits better. > >>>>> > >>>>> Thanks! > >>>>> > >>>>>> Thanks, > >>>>>> -Siwei > >>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>> > >>>>>>>>>> Another concern is the use_va stuff, originally it tags to the > >>>>>>>>>> device > >>>>>>>>>> level and is made static at the time of device instantiation, > >>>>>>>>>> which is > >>>>>>>>>> fine. But others to come just find a new home at per-group level or > >>>>>>>>>> per-vq level struct. Hard to tell whether or not pinning is > >>>>>>>>>> actually > >>>>>>>>>> needed for the latter use_va friends, as they are essentially tied > >>>>>>>>>> to > >>>>>>>>>> the virtio life cycle or feature negotiation. While guest / Qemu > >>>>>>>>>> starts > >>>>>>>>>> way earlier than that. Perhaps just ignore those sub-device level > >>>>>>>>>> use_va > >>>>>>>>>> usages? Presumably !use_va at the device level is sufficient to > >>>>>>>>>> infer > >>>>>>>>>> the need of pinning for device? > >>>>>>>>>> > >>>>>>>>> I don't follow this. But I have the feeling that the subject of my > >>>>>>>>> original mail is way more accurate if I would have said just "memory > >>>>>>>>> maps". > >>>>>>>> I think the iotlb layer in vhost-vdpa just provides the abstraction > >>>>>>>> for > >>>>>>>> mapping, not pinning. Although in some case mapping implicitly > >>>>>>>> relies on > >>>>>>>> pinning for DMA purpose, it doesn't have to tie to that in uAPI > >>>>>>>> semantics. We can do explicit on-demand pinning for cases for e.g. > >>>>>>>> warming up device at live migration destination. > >>>>>>>> > >>>>>>>> > >>>>>>>>> I still consider the way to fix it is to actually delegate that to > >>>>>>>>> the > >>>>>>>>> kernel vdpa, so it can choose if a particular ASID needs the pin or > >>>>>>>>> not. But let me know if I missed something. > >>>>>>>> You can disregard this for now. I will discuss that further with you > >>>>>>>> guys while bind_mm and per-group use_va stuffs are landed. > >>>>>>>> > >>>>>>>> Thanks! > >>>>>>>> -Siwei > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> Thanks! > >>>>>>>>> > >>>>>>>>>> Regards, > >>>>>>>>>> -Siwei > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> This has two main problems: > >>>>>>>>>>> * At this moment the reset semantics forces the vdpa device to > >>>>>>>>>>> unmap > >>>>>>>>>>> all the memory. So this change needs a vhost vdpa feature flag. > >>>>>>>>>>> * This may increase the initialization time. Maybe we can delay > >>>>>>>>>>> it if > >>>>>>>>>>> qemu is not the destination of a LM. Anyway I think this should be > >>>>>>>>>>> done as an optimization on top. > >>>>>>>>>>> > >>>>>>>>>>> Any ideas or comments in this regard? > >>>>>>>>>>> > >>>>>>>>>>> Thanks! > >>>>>>>>>>> >