On Fri, Nov 3, 2023 at 9:19 PM Si-Wei Liu <si-wei....@oracle.com> wrote: > > > > On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote: > > On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei....@oracle.com> wrote: > >> > >> > >> On 10/19/2023 7:34 AM, Eugenio Pérez wrote: > >>> Current memory operations like pinning may take a lot of time at the > >>> > >>> destination. Currently they are done after the source of the migration is > >>> > >>> stopped, and before the workload is resumed at the destination. This is a > >>> > >>> period where neigher traffic can flow, nor the VM workload can continue > >>> > >>> (downtime). > >>> > >>> > >>> > >>> We can do better as we know the memory layout of the guest RAM at the > >>> > >>> destination from the moment the migration starts. Moving that operation > >>> allows > >>> > >>> QEMU to communicate the kernel the maps while the workload is still > >>> running in > >>> > >>> the source, so Linux can start mapping them. Ideally, all IOMMU is > >>> configured, > >>> > >>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still > >>> > >>> saving all the pinning time. > >> I get what you want to say, though not sure how pinning is relevant to > >> on-chip IOMMU and .set_map here, essentially pinning is required for all > >> parent vdpa drivers that perform DMA hence don't want VM pages to move > >> around. > > Basically highlighting that the work done under .set_map is not only > > pinning, but it is a significant fraction on it. It can be reworded or > > deleted for sure. > > > >>> > >>> > >>> Note that further devices setup at the end of the migration may alter the > >>> guest > >>> > >>> memory layout. But same as the previous point, many operations are still > >>> done > >>> > >>> incrementally, like memory pinning, so we're saving time anyway. > >>> > >>> > >>> > >>> The first bunch of patches just reorganizes the code, so memory related > >>> > >>> operation parameters are shared between all vhost_vdpa devices. This is > >>> > >>> because the destination does not know what vhost_vdpa struct will have the > >>> > >>> registered listener member, so it is easier to place them in a shared > >>> struct > >>> > >>> rather to keep them in vhost_vdpa struct. Future version may squash or > >>> omit > >>> > >>> these patches. > >> It looks this VhostVDPAShared facility (patch 1-13) is also what I need > >> in my SVQ descriptor group series [*], for which I've built similar > >> construct there. If possible please try to merge this in ASAP. I'll > >> rework my series on top of that. > >> > >> [*] > >> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5 > >> > > I can send it individually, for sure. > > > > MST, Jason, can this first part be merged? It doesn't add a lot by > > itself but it helps pave the way for future changes. > If it cannot, it doesn't matter. I can pick it from here and get my > series posted with your patches 1-13 applied upfront. This should work, > I think? > > >>> > >>> > >>> Only tested with vdpa_sim. I'm sending this before full benchmark, as > >>> some work > >>> > >>> like [1] can be based on it, and Si-Wei agreed on benchmark this series > >>> with > >>> > >>> his experience. > >> Haven't done the full benchmark compared to pre-map at destination yet, > >> though an observation is that the destination QEMU seems very easy to > >> get stuck for very long time while in mid of pinning pages. During this > >> period, any client doing read-only QMP query or executing HMP info > >> command got frozen indefinitely (subject to how large size the memory is > >> being pinned). Is it possible to unblock those QMP request or HMP > >> command from being executed (at least the read-only ones) while in > >> migration? Yield from the load_setup corourtine and spawn another thread? > >> > > Ok, I wasn't aware of that. > > > > I think we cannot yield in a coroutine and wait for an ioctl. > I was wondering if we need a separate coroutine out of the general > migration path to support this special code without overloading > load_setup or its callers. For instance, unblock the source from sending > guest rams while allow destination pin pages in parallel should be > possible. >
Hi Si-Wei, I'm working on this, I think I'll be able to send a new version soon. Just a question, when the mapping is done in vhost_vdpa_dev_start as the current upstream master does, are you able to interact with QMP? Thanks! > Regardless, a separate thread is needed to carry out all the heavy > lifting, i.e. ioctl(2) or write(2) syscalls to map&pin pages. > > > > One > > option that came to my mind is to effectively use another thread, and > > use a POSIX barrier (or equivalent on glib / QEMU) before finishing > > the migration. > Yes, a separate thread is needed anyway. > > > I'm not sure if there are more points where we can > > check the barrier and tell the migration to continue or stop though. > I think there is, for e.g. what if the dma_map fails. There must be a > check point for that. > > > > > Another option is to effectively start doing these ioctls in an > > asynchronous way, io_uring cmds like, but I'd like to achieve this > > first. > Yes, io_uring or any async API could be another option. Though this > needs new uAPI through additional kernel facility to support. Anyway, > it's up to you to decide. :) > > Regards, > -Siwei > >> Having said, not sure if .load_setup is a good fit for what we want to > >> do. Searching all current users of .load_setup, either the job can be > >> done instantly or the task is time bound without trapping into kernel > >> for too long. Maybe pinning is too special use case here... > >> > >> -Siwei > >>> > >>> > >>> Future directions on top of this series may include: > >>> > >>> * Iterative migration of virtio-net devices, as it may reduce downtime > >>> per [1]. > >>> > >>> vhost-vdpa net can apply the configuration through CVQ in the > >>> destination > >>> > >>> while the source is still migrating. > >>> > >>> * Move more things ahead of migration time, like DRIVER_OK. > >>> > >>> * Check that the devices of the destination are valid, and cancel the > >>> migration > >>> > >>> in case it is not. > >>> > >>> > >>> > >>> [1] > >>> https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566...@nvidia.com/T/ > >>> > >>> > >>> > >>> Eugenio Pérez (18): > >>> > >>> vdpa: add VhostVDPAShared > >>> > >>> vdpa: move iova tree to the shared struct > >>> > >>> vdpa: move iova_range to vhost_vdpa_shared > >>> > >>> vdpa: move shadow_data to vhost_vdpa_shared > >>> > >>> vdpa: use vdpa shared for tracing > >>> > >>> vdpa: move file descriptor to vhost_vdpa_shared > >>> > >>> vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared > >>> > >>> vdpa: move backend_cap to vhost_vdpa_shared > >>> > >>> vdpa: remove msg type of vhost_vdpa > >>> > >>> vdpa: move iommu_list to vhost_vdpa_shared > >>> > >>> vdpa: use VhostVDPAShared in vdpa_dma_map and unmap > >>> > >>> vdpa: use dev_shared in vdpa_iommu > >>> > >>> vdpa: move memory listener to vhost_vdpa_shared > >>> > >>> vdpa: do not set virtio status bits if unneeded > >>> > >>> vdpa: add vhost_vdpa_load_setup > >>> > >>> vdpa: add vhost_vdpa_net_load_setup NetClient callback > >>> > >>> vdpa: use shadow_data instead of first device v->shadow_vqs_enabled > >>> > >>> virtio_net: register incremental migration handlers > >>> > >>> > >>> > >>> include/hw/virtio/vhost-vdpa.h | 43 +++++--- > >>> > >>> include/net/net.h | 4 + > >>> > >>> hw/net/virtio-net.c | 23 +++++ > >>> > >>> hw/virtio/vdpa-dev.c | 7 +- > >>> > >>> hw/virtio/vhost-vdpa.c | 183 ++++++++++++++++++--------------- > >>> > >>> net/vhost-vdpa.c | 127 ++++++++++++----------- > >>> > >>> hw/virtio/trace-events | 14 +-- > >>> > >>> 7 files changed, 239 insertions(+), 162 deletions(-) > >>> > >>> > >>> >