On Mon, Aug 18, 2025 at 6:21 PM Peter Xu <pet...@redhat.com> wrote: > > On Mon, Aug 18, 2025 at 10:46:00AM -0400, Jonah Palmer wrote: > > > > > > On 8/18/25 2:51 AM, Eugenio Perez Martin wrote: > > > On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.pal...@oracle.com> > > > wrote: > > > > > > > > > > > > > > > > On 8/14/25 5:28 AM, Eugenio Perez Martin wrote: > > > > > On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <pet...@redhat.com> wrote: > > > > > > > > > > > > On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin > > > > > > wrote: > > > > > > > On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <pet...@redhat.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote: > > > > > > > > > This effort was started to reduce the guest visible downtime > > > > > > > > > by > > > > > > > > > virtio-net/vhost-net/vhost-vDPA during live migration, > > > > > > > > > especially > > > > > > > > > vhost-vDPA. > > > > > > > > > > > > > > > > > > The downtime contributed by vhost-vDPA, for example, is not > > > > > > > > > from having to > > > > > > > > > migrate a lot of state but rather expensive backend > > > > > > > > > control-plane latency > > > > > > > > > like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN > > > > > > > > > filters, offload > > > > > > > > > settings, MTU, etc.). Doing this requires kernel/HW NIC > > > > > > > > > operations which > > > > > > > > > dominates its downtime. > > > > > > > > > > > > > > > > > > In other words, by migrating the state of virtio-net early > > > > > > > > > (before the > > > > > > > > > stop-and-copy phase), we can also start staging backend > > > > > > > > > configurations, > > > > > > > > > which is the main contributor of downtime when migrating a > > > > > > > > > vhost-vDPA > > > > > > > > > device. > > > > > > > > > > > > > > > > > > I apologize if this series gives the impression that we're > > > > > > > > > migrating a lot > > > > > > > > > of data here. It's more along the lines of moving > > > > > > > > > control-plane latency out > > > > > > > > > of the stop-and-copy phase. > > > > > > > > > > > > > > > > I see, thanks. > > > > > > > > > > > > > > > > Please add these into the cover letter of the next post. IMHO > > > > > > > > it's > > > > > > > > extremely important information to explain the real goal of > > > > > > > > this work. I > > > > > > > > bet it is not expected for most people when reading the current > > > > > > > > cover > > > > > > > > letter. > > > > > > > > > > > > > > > > Then it could have nothing to do with iterative phase, am I > > > > > > > > right? > > > > > > > > > > > > > > > > What are the data needed for the dest QEMU to start staging > > > > > > > > backend > > > > > > > > configurations to the HWs underneath? Does dest QEMU already > > > > > > > > have them in > > > > > > > > the cmdlines? > > > > > > > > > > > > > > > > Asking this because I want to know whether it can be done > > > > > > > > completely > > > > > > > > without src QEMU at all, e.g. when dest QEMU starts. > > > > > > > > > > > > > > > > If src QEMU's data is still needed, please also first consider > > > > > > > > providing > > > > > > > > such facility using an "early VMSD" if it is ever possible: > > > > > > > > feel free to > > > > > > > > refer to commit 3b95a71b22827d26178. > > > > > > > > > > > > > > > > > > > > > > While it works for this series, it does not allow to resend the > > > > > > > state > > > > > > > when the src device changes. For example, if the number of > > > > > > > virtqueues > > > > > > > is modified. > > > > > > > > > > > > Some explanation on "how sync number of vqueues helps downtime" > > > > > > would help. > > > > > > Not "it might preheat things", but exactly why, and how that > > > > > > differs when > > > > > > it's pure software, and when hardware will be involved. > > > > > > > > > > > > > > > > By nvidia engineers to configure vqs (number, size, RSS, etc) takes > > > > > about ~200ms: > > > > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566...@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$ > > > > > > > > > > Adding Dragos here in case he can provide more details. Maybe the > > > > > numbers have changed though. > > > > > > > > > > And I guess the difference with pure SW will always come down to PCI > > > > > communications, which assume it is slower than configuring the host SW > > > > > device in RAM or even CPU cache. But I admin that proper profiling is > > > > > needed before making those claims. > > > > > > > > > > Jonah, can you print the time it takes to configure the vDPA device > > > > > with traces vs the time it takes to enable the dataplane of the > > > > > device? So we can get an idea of how much time we save with this. > > > > > > > > > > > > > Let me know if this isn't what you're looking for. > > > > > > > > I'm assuming by "configuration time" you mean: > > > > - Time from device startup (entry to vhost_vdpa_dev_start()) to right > > > > before we start enabling the vrings (e.g. > > > > VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()). > > > > > > > > And by "time taken to enable the dataplane" I'm assuming you mean: > > > > - Time right before we start enabling the vrings (see above) to right > > > > after we enable the last vring (at the end of > > > > vhost_vdpa_net_cvq_load()) > > > > > > > > Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs: > > > > > > > > -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0, > > > > queues=8,x-svq=on > > > > > > > > -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1, > > > > romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on, > > > > ctrl_vlan=off,vectors=18,host_mtu=9000, > > > > disable-legacy=on,disable-modern=off > > > > > > > > --- > > > > > > > > Configuration time: ~31s > > > > Dataplane enable time: ~0.14ms > > > > > > > > > > I was vague, but yes, that's representative enough! It would be more > > > accurate if the configuration time ends by the time QEMU enables the > > > first queue of the dataplane though. > > > > > > As Si-Wei mentions, is v->shared->listener_registered == true at the > > > beginning of vhost_vdpa_dev_start? > > > > > > > Ah, I also realized that Qemu I was using for measurements was using a > > version before the listener_registered member was introduced. > > > > I retested with the latest changes in Qemu and set x-svq=off, e.g.: guest > > specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3 times for > > measurements. > > > > v->shared->listener_registered == false at the beginning of > > vhost_vdpa_dev_start(). > > > > --- > > > > Configuration time: Time from first entry into vhost_vdpa_dev_start() to > > right after Qemu enables the first VQ. > > - 26.947s, 26.606s, 27.326s > > It's surprising to know it takes 20+ seconds for one device to load. > > Sorry I'm not familiar with CVQ, please bare with me on my ignorance: how > much CVQ=on contributes to this? Is page pinning involved here? Is 128GB > using small pages only? >
CVQ=on is just enabled so we can enable multiqueue, as the HW device configuration time seems ~linear with this. > It looks to me there can still be many things that vDPA will face similar > challenges that VFIO already had. For example, there's current work > optimizing pinning for VFIO here: > > https://lore.kernel.org/all/20250814064714.56485-1-lizhe...@bytedance.com/ > > For the long term, not sure if (for either VFIO or vDPA, or similar devices > that needs guest pinning) it would make more sense to start using 1G huge > pages just for the sake of fast pinning. > > PFNMAP in VFIO already works with 1G pfnmaps with commit eb996eec783c. > Logically if we could use 1G pages (e.g. on x86_64) for guest, then pinning > / unpinning can also be easily batched, and DMA pinning should be much > faster. The same logic may also apply to vDPA if it works the similar way. > > The work above was still generic, but I mentioned the idea of optimizing > for 1G huge pages here: > > https://lore.kernel.org/all/aC3z_gUxJbY1_JP7@x1.local/#t > > Above is just FYI.. definitely not an request to work on that. So if we > can better split the issue into smaller but multiple scope of works it > would be nicer. I agree. QEMU master is already able to do the memory pinning before the downtime, so let's profile that way. > The "iterable migratable virtio-net" might just hide too > many things under the hood. > > > > > Enable dataplane: Time from right after first VQ is enabled to right after > > the last VQ is enabled. > > - 0.081ms, 0.081ms, 0.079ms > > > > The other thing that might worth mention.. from migration perspective, VFIO > used to introduce one feature called switchover-ack: > > # @switchover-ack: If enabled, migration will not stop the source VM > # and complete the migration until an ACK is received from the > # destination that it's OK to do so. Exactly when this ACK is > # sent depends on the migrated devices that use this feature. For > # example, a device can use it to make sure some of its data is > # sent and loaded in the destination before doing switchover. > # This can reduce downtime if devices that support this capability > # are present. 'return-path' capability must be enabled to use > # it. (since 8.1) > > If above 20+ seconds are not avoidable, not sure if virtio-net would like > to opt-in in this feature too, so that switchover won't happen too soon > during an pre-mature preheat, so that won't be accounted into downtime. > > Again, just FYI. I'm not sure if it's applicable. > Yes it is, my first versions used it :). As you said, maybe we need to use it here so it is worth it to not miss it!