Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration

Eugenio Perez Martin Tue, 19 Aug 2025 00:22:38 -0700

On Mon, Aug 18, 2025 at 6:21 PM Peter Xu <pet...@redhat.com> wrote:
>
> On Mon, Aug 18, 2025 at 10:46:00AM -0400, Jonah Palmer wrote:
> >
> >
> > On 8/18/25 2:51 AM, Eugenio Perez Martin wrote:
> > > On Fri, Aug 15, 2025 at 4:50 PM Jonah Palmer <jonah.pal...@oracle.com> 
> > > wrote:
> > > >
> > > >
> > > >
> > > > On 8/14/25 5:28 AM, Eugenio Perez Martin wrote:
> > > > > On Wed, Aug 13, 2025 at 4:06 PM Peter Xu <pet...@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Aug 13, 2025 at 11:25:00AM +0200, Eugenio Perez Martin 
> > > > > > wrote:
> > > > > > > On Mon, Aug 11, 2025 at 11:56 PM Peter Xu <pet...@redhat.com> 
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 11, 2025 at 05:26:05PM -0400, Jonah Palmer wrote:
> > > > > > > > > This effort was started to reduce the guest visible downtime 
> > > > > > > > > by
> > > > > > > > > virtio-net/vhost-net/vhost-vDPA during live migration, 
> > > > > > > > > especially
> > > > > > > > > vhost-vDPA.
> > > > > > > > >
> > > > > > > > > The downtime contributed by vhost-vDPA, for example, is not 
> > > > > > > > > from having to
> > > > > > > > > migrate a lot of state but rather expensive backend 
> > > > > > > > > control-plane latency
> > > > > > > > > like CVQ configurations (e.g. MQ queue pairs, RSS, MAC/VLAN 
> > > > > > > > > filters, offload
> > > > > > > > > settings, MTU, etc.). Doing this requires kernel/HW NIC 
> > > > > > > > > operations which
> > > > > > > > > dominates its downtime.
> > > > > > > > >
> > > > > > > > > In other words, by migrating the state of virtio-net early 
> > > > > > > > > (before the
> > > > > > > > > stop-and-copy phase), we can also start staging backend 
> > > > > > > > > configurations,
> > > > > > > > > which is the main contributor of downtime when migrating a 
> > > > > > > > > vhost-vDPA
> > > > > > > > > device.
> > > > > > > > >
> > > > > > > > > I apologize if this series gives the impression that we're 
> > > > > > > > > migrating a lot
> > > > > > > > > of data here. It's more along the lines of moving 
> > > > > > > > > control-plane latency out
> > > > > > > > > of the stop-and-copy phase.
> > > > > > > >
> > > > > > > > I see, thanks.
> > > > > > > >
> > > > > > > > Please add these into the cover letter of the next post.  IMHO 
> > > > > > > > it's
> > > > > > > > extremely important information to explain the real goal of 
> > > > > > > > this work.  I
> > > > > > > > bet it is not expected for most people when reading the current 
> > > > > > > > cover
> > > > > > > > letter.
> > > > > > > >
> > > > > > > > Then it could have nothing to do with iterative phase, am I 
> > > > > > > > right?
> > > > > > > >
> > > > > > > > What are the data needed for the dest QEMU to start staging 
> > > > > > > > backend
> > > > > > > > configurations to the HWs underneath?  Does dest QEMU already 
> > > > > > > > have them in
> > > > > > > > the cmdlines?
> > > > > > > >
> > > > > > > > Asking this because I want to know whether it can be done 
> > > > > > > > completely
> > > > > > > > without src QEMU at all, e.g. when dest QEMU starts.
> > > > > > > >
> > > > > > > > If src QEMU's data is still needed, please also first consider 
> > > > > > > > providing
> > > > > > > > such facility using an "early VMSD" if it is ever possible: 
> > > > > > > > feel free to
> > > > > > > > refer to commit 3b95a71b22827d26178.
> > > > > > > >
> > > > > > >
> > > > > > > While it works for this series, it does not allow to resend the 
> > > > > > > state
> > > > > > > when the src device changes. For example, if the number of 
> > > > > > > virtqueues
> > > > > > > is modified.
> > > > > >
> > > > > > Some explanation on "how sync number of vqueues helps downtime" 
> > > > > > would help.
> > > > > > Not "it might preheat things", but exactly why, and how that 
> > > > > > differs when
> > > > > > it's pure software, and when hardware will be involved.
> > > > > >
> > > > >
> > > > > By nvidia engineers to configure vqs (number, size, RSS, etc) takes
> > > > > about ~200ms:
> > > > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566...@nvidia.com/T/__;!!ACWV5N9M2RV99hQ!OQdf7sGaBlbXhcFHX7AC7HgYxvFljgwWlIgJCvMgWwFvPqMrAMbWqf0862zV5shIjaUvlrk54fLTK6uo2pA$
> > > > >
> > > > > Adding Dragos here in case he can provide more details. Maybe the
> > > > > numbers have changed though.
> > > > >
> > > > > And I guess the difference with pure SW will always come down to PCI
> > > > > communications, which assume it is slower than configuring the host SW
> > > > > device in RAM or even CPU cache. But I admin that proper profiling is
> > > > > needed before making those claims.
> > > > >
> > > > > Jonah, can you print the time it takes to configure the vDPA device
> > > > > with traces vs the time it takes to enable the dataplane of the
> > > > > device? So we can get an idea of how much time we save with this.
> > > > >
> > > >
> > > > Let me know if this isn't what you're looking for.
> > > >
> > > > I'm assuming by "configuration time" you mean:
> > > >    - Time from device startup (entry to vhost_vdpa_dev_start()) to right
> > > >      before we start enabling the vrings (e.g.
> > > >      VHOST_VDPA_SET_VRING_ENABLE in vhost_vdpa_net_cvq_load()).
> > > >
> > > > And by "time taken to enable the dataplane" I'm assuming you mean:
> > > >    - Time right before we start enabling the vrings (see above) to right
> > > >      after we enable the last vring (at the end of
> > > >      vhost_vdpa_net_cvq_load())
> > > >
> > > > Guest specs: 128G Mem, SVQ=on, CVQ=on, 8 queue pairs:
> > > >
> > > > -netdev type=vhost-vdpa,vhostdev=$VHOST_VDPA_0,id=vhost-vdpa0,
> > > >           queues=8,x-svq=on
> > > >
> > > > -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,bootindex=-1,
> > > >           romfile=,page-per-vq=on,mac=$VF1_MAC,ctrl_vq=on,mq=on,
> > > >           ctrl_vlan=off,vectors=18,host_mtu=9000,
> > > >           disable-legacy=on,disable-modern=off
> > > >
> > > > ---
> > > >
> > > > Configuration time:    ~31s
> > > > Dataplane enable time: ~0.14ms
> > > >
> > >
> > > I was vague, but yes, that's representative enough! It would be more
> > > accurate if the configuration time ends by the time QEMU enables the
> > > first queue of the dataplane though.
> > >
> > > As Si-Wei mentions, is v->shared->listener_registered == true at the
> > > beginning of vhost_vdpa_dev_start?
> > >
> >
> > Ah, I also realized that Qemu I was using for measurements was using a
> > version before the listener_registered member was introduced.
> >
> > I retested with the latest changes in Qemu and set x-svq=off, e.g.: guest
> > specs: 128G Mem, SVQ=off, CVQ=on, 8 queue pairs. I ran testing 3 times for
> > measurements.
> >
> > v->shared->listener_registered == false at the beginning of
> > vhost_vdpa_dev_start().
> >
> > ---
> >
> > Configuration time: Time from first entry into vhost_vdpa_dev_start() to
> > right after Qemu enables the first VQ.
> >  - 26.947s, 26.606s, 27.326s
>
> It's surprising to know it takes 20+ seconds for one device to load.
>
> Sorry I'm not familiar with CVQ, please bare with me on my ignorance: how
> much CVQ=on contributes to this?  Is page pinning involved here?  Is 128GB
> using small pages only?
>


CVQ=on is just enabled so we can enable multiqueue, as the HW device
configuration time seems ~linear with this.

> It looks to me there can still be many things that vDPA will face similar
> challenges that VFIO already had.  For example, there's current work
> optimizing pinning for VFIO here:
>
> https://lore.kernel.org/all/20250814064714.56485-1-lizhe...@bytedance.com/
>
> For the long term, not sure if (for either VFIO or vDPA, or similar devices
> that needs guest pinning) it would make more sense to start using 1G huge
> pages just for the sake of fast pinning.
>
> PFNMAP in VFIO already works with 1G pfnmaps with commit eb996eec783c.
> Logically if we could use 1G pages (e.g. on x86_64) for guest, then pinning
> / unpinning can also be easily batched, and DMA pinning should be much
> faster.  The same logic may also apply to vDPA if it works the similar way.
>
> The work above was still generic, but I mentioned the idea of optimizing
> for 1G huge pages here:
>
> https://lore.kernel.org/all/aC3z_gUxJbY1_JP7@x1.local/#t
>
> Above is just FYI.. definitely not an request to work on that.  So if we
> can better split the issue into smaller but multiple scope of works it
> would be nicer.

I agree. QEMU master is already able to do the memory pinning before
the downtime, so let's profile that way.

> The "iterable migratable virtio-net" might just hide too
> many things under the hood.
>
> >
> > Enable dataplane: Time from right after first VQ is enabled to right after
> > the last VQ is enabled.
> >  - 0.081ms, 0.081ms, 0.079ms
> >
>
> The other thing that might worth mention.. from migration perspective, VFIO
> used to introduce one feature called switchover-ack:
>
> # @switchover-ack: If enabled, migration will not stop the source VM
> #     and complete the migration until an ACK is received from the
> #     destination that it's OK to do so.  Exactly when this ACK is
> #     sent depends on the migrated devices that use this feature.  For
> #     example, a device can use it to make sure some of its data is
> #     sent and loaded in the destination before doing switchover.
> #     This can reduce downtime if devices that support this capability
> #     are present.  'return-path' capability must be enabled to use
> #     it.  (since 8.1)
>
> If above 20+ seconds are not avoidable, not sure if virtio-net would like
> to opt-in in this feature too, so that switchover won't happen too soon
> during an pre-mature preheat, so that won't be accounted into downtime.
>
> Again, just FYI. I'm not sure if it's applicable.
>

Yes it is, my first versions used it :). As you said, maybe we need to
use it here so it is worth it to not miss it!

Re: [RFC 5/6] virtio,virtio-net: skip consistency check in virtio_load for iterative migration

Reply via email to