Re: [RFC PATCH] virtio-net: introduce strict peer feature check

Jason Wang Sun, 16 Nov 2025 20:33:14 -0800

On Sun, Nov 16, 2025 at 2:53 PM Michael S. Tsirkin <[email protected]> wrote:
>
> On Fri, Nov 14, 2025 at 09:32:47AM +0800, Jason Wang wrote:
> > On Fri, Nov 14, 2025 at 1:47 AM Michael S. Tsirkin <[email protected]> wrote:
> > >
> > > On Thu, Nov 13, 2025 at 12:12:38PM -0500, Peter Xu wrote:
> > > > On Thu, Nov 13, 2025 at 11:47:51AM -0500, Michael S. Tsirkin wrote:
> > > > > On Thu, Nov 13, 2025 at 11:37:25AM -0500, Peter Xu wrote:
> > > > > > On Thu, Nov 13, 2025 at 11:09:32AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Fri, Nov 07, 2025 at 10:01:49AM +0800, Jason Wang wrote:
> > > > > > > > We used to clear features silently in virtio_net_get_features() 
> > > > > > > > even
> > > > > > > > if it is required. This complicates the live migration 
> > > > > > > > compatibility
> > > > > > > > as the management layer may think the feature is enabled but in 
> > > > > > > > fact
> > > > > > > > not.
> > > > > > > >
> > > > > > > > Let's add a strict feature check to make sure if there's a 
> > > > > > > > mismatch
> > > > > > > > between the required feature and peer, fail the get_features()
> > > > > > > > immediately instead of waiting until the migration to fail. This
> > > > > > > > offload the migration compatibility completely to the management
> > > > > > > > layer.
> > > > > > > >
> > > > > > > > Signed-off-by: Jason Wang <[email protected]>
> > > > > > >
> > > > > > > This is not really useful - how do users know how to tweak their
> > > > > > > command lines?
> > > > > > > We discussed this many times.
> > > > > > > To try and solve this you need a tool that will tell you how to 
> > > > > > > start
> > > > > > > VM on X to make it migrateable to Y or Z.
> > > > > > >
> > > > > > >
> > > > > > > More importantly,
> > > > > > > migration is a niche thing and breaking booting perfectly good VMs
> > > > > > > just for that seems wrong.
> > > > > >
> > > > > > IMHO Jason's proposal is useful in that it now provides a way to 
> > > > > > provide
> > > > > > ABI stablility but allows auto-ON to exist.
> > > > > >
> > > > > > If we think migration is optional, we could add a migration blocker 
> > > > > > where
> > > > > > strict check flag is set to OFF, as I mentioned in the email reply 
> > > > > > to Dan.
> > > > > > As that implies the VM ABI is not guaranteed.
> > > > > >
> > > > > > Thanks,
> > > > >
> > > > >
> > > > > All you have to do is avoid changing the kernel and ABI is stable.
> > > > > Downstreams already do this.
> > > >
> > > > But the whole point of migration is allowing VMs to move between hosts..
> > > > hence AFAIU kernel can change.
> > > >
> > > > Downstream will still have problem if some network features will be
> > > > optionally supported in some of the RHEL-N branches, because machine 
> > > > types
> > > > are defined the same in any RHEL-N, so IIUC it's also possible a VM 
> > > > booting
> > > > on a latest RHEL-X.Y qemu/kernel hit issues migrating back to an older
> > > > RHEL-X.(Y-1) qemu/kernel if RHEL-X.(Y-1) kernel doesn't have the network
> > > > feature available..
> > > >
> > > > It's also not good IMHO to only fix downstream but having upstream face
> > > > such problems, even if there's a downstream fix...
> > > >
> > > > This thread was revived only because Jinpu hit similar issues.  IMHO we
> > > > should still try to provide a generic solution upstream for everyone.
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Peter Xu
> > >
> > > failing to start a perfectly good qemu which used to work
> > > because you changed kernels is better than failing to migrate how?
> >
> > It doesn't:
> >
> > 1) the strict feature check will be only enabled in new machine types
> > 2) if kernel ABI is stable, qemu will keep working after upgrading
> > kernel even with strict check otherwise it would be a bug of kernel
> >
> > So I don't see it breaking anything if we make it start to work at 11.0?
>
> Using QEMU from git suddenly requires upgrading the kernel or figuring
> out obscure flags? Ugh.


Only the setups are buggy that might meet this.

>
>
> > >
> > >
> > >
> > > graceful downgrade with old kernels is the basics of good userspace
> > > behaviour and has been for decades.
> >
> > Peter has given the example of how hard we can define gracefulness
> > (e.g migrate from a kernel w/ USO to a kernel w/o USO) and fix.
> >
> > Maybe we can think of a usersapce fallback to emulation of USO or
> > others, but I'm not sure if it's an overkill.
> >
> > >
> > >
> > > sure, let's work on a solution, just erroring out is more about blaming
> > > the user. what is the user supposed to do when qemu fails to start?
> >
> > It's the first step as it's much better than silently clearing the
> > feature which may confuse both user and migration. We can use warnings
> > instead of errors but I'm not sure how much it can help.
>
>
> Well with this first step we have successfully blamed the user and
> the second step won't ever be taken.

Are you suggesting to fix the management? E.g patching libvirt to
probe tap features?

>
> > >
> > >
> > > first, formulate what exactly do you want to enable.
> > >
> > >
> > >
> > > for example, you have a set of boxes and you want a set of flags
> > > to supply to guarantee qemu can migrate between them. is that it?
> >
> > Mostly, it should work as a CPU cluster.
>
> the reason it kinda works with CPU cluster is simply because
> there is a final set of CPU models and you can not easily
> switch your CPU to a different model.

We can define a set of TAP features as well, but I'm not sure it's
worthwhile to do this.

>
> > So it's the responsibility of
> > the management layer, maybe we can develop some tool to report this or
> > via qemu introspection ("query-tap" ?). Or if the management can do
> > this now, we don't even need to bother (or it can help to uncover
> > bugs). Anyhow, clearing a feature silently is not good and can cover
> > bugs of various layers.
> >
> > Note that this issue is not specific to TAP, we may meet this for
> > vDPA/VFIO live migration as well. Basically, it should be the
> > responsibility of the management layer to deal with those migration
> > compatibility policies instead of using hard coded policies inside
> > Qemu. For qemu, it can simply error out when there's a mismatch
> > between features that are supported and features that are asked to
> > enable. We've suffered a lot in the past when trying to deal with this
> > by Qemu.
> >
> > Thanks
>
> Yes but QEMU currently gives management no tools to figure out
> what is important for it.

Using Qemu might be problematic as usually it doesn't not have privilege.

We can extend iproute, or a dedicated tool or ask libvirt to do this.
If libvirt could do the probe by itself, could we start from that?

Thanks


>
>
>
> > >
> > >
> > >
> > > --
> > > MST
> > >
>

Re: [RFC PATCH] virtio-net: introduce strict peer feature check

Reply via email to