On Wed, Sep 3, 2025 at 10:38 AM Christian Ehrhardt <
christian.ehrha...@canonical.com> wrote:

> On Wed, Aug 20, 2025 at 7:11 AM Christian Ehrhardt
> <christian.ehrha...@canonical.com> wrote:
> >
> > On Tue, Aug 19, 2025 at 4:51 PM Paolo Bonzini <pbonz...@redhat.com>
> wrote:
> > >
> > > On 8/6/25 21:18, Daniel P. Berrangé wrote:
> > > > On Wed, Aug 06, 2025 at 07:57:34PM +0200, Christian Ehrhardt wrote:
> > > >> On Wed, Aug 6, 2025 at 2:00 PM Daniel P. Berrangé <
> berra...@redhat.com> wrote:
> > > >>>
> > > >>> On Wed, Aug 06, 2025 at 01:52:17PM +0200, Christian Ehrhardt wrote:
> > > >>>> Hi,
> > > >>>> I was unsure if this would be better sent to libvirt or qemu - the
> > > >>>> issue is somewhere between libvirt modelling CPUs and qemu 10.1
> > > >>>> behaving differently. I did not want to double post and gladly
> most of
> > > >>>> the people are on both lists - since the switch in/out of the
> problem
> > > >>>> is qemu 10.0 <-> 10.1 let me start here. I beg your pardon for
> not yet
> > > >>>> having all the answers, I'm sure I could find more with
> debugging, but
> > > >>>> I also wanted to report early for your awareness while we are
> still in
> > > >>>> the RC phase.
> > > >>>>
> > > >>>>
> > > >>>> # Problem
> > > >>>>
> > > >>>> What I found when testing migrations in Ubuntu with qemu 10.1-rc1
> was:
> > > >>>>    error: operation failed: guest CPU doesn't match specification:
> > > >>>> missing features: pdcm
> > > >>>>
> > > >>>> This is behaving the same with libvirt 11.4 or the more recent
> 11.6.
> > > >>>> But switching back to qemu 10.0 confirmed that this behavior is
> new
> > > >>>> with qemu 10.1-rc.
> > > >>>
> > > >>>
> > > >>>> Without yet having any hard evidence against them I found a few
> pdcm
> > > >>>> related commits between 10.0 and 10.1-rc1:
> > > >>>>    7ff24fb65 i386/tdx: Don't mask off CPUID_EXT_PDCM
> > > >>>>    00268e000 i386/cpu: Warn about why CPUID_EXT_PDCM is not
> available
> > > >>>>    e68ec2980 i386/cpu: Move adjustment of CPUID_EXT_PDCM before
> > > >>>> feature_dependencies[] check
> > > >>>>    0ba06e46d i386/tdx: Add TDX fixed1 bits to supported CPUIDs
> > > >>>>
> > > >>>>
> > > >>>> # Caveat
> > > >>>>
> > > >>>> My test environment is in LXD system containers, that gives me
> issues
> > > >>>> in the power management detection
> > > >>>>    libvirtd[406]: error from service:
> GDBus.Error:System.Error.EROFS:
> > > >>>> Read-only file system
> > > >>>>    libvirtd[406]: Failed to get host power management capabilities
> > > >>>
> > > >>> That's harmless.
> > > >>
> > > >> Yeah, it always was for me - thanks for confirming.
> > > >>
> > > >>>> And the resulting host-model on a  rather old test server will
> therefore have:
> > > >>>>    <cpu mode='custom' match='exact' check='full'>
> > > >>>>      <model fallback='forbid'>Haswell-noTSX-IBRS</model>
> > > >>>>      <vendor>Intel</vendor>
> > > >>>>      <feature policy='require' name='vmx'/>
> > > >>>>      <feature policy='disable' name='pdcm'/>
> > > >>>>       ...
> > > >>>>
> > > >>>> But that was fine in the past, and the behavior started to break
> > > >>>> save/restore or migrations just now with the new qemu 10.1-rc.
> > > >>>>
> > > >>>> # Next steps
> > > >>>>
> > > >>>> I'm soon overwhelmed by meetings for the rest of the day, but
> would be
> > > >>>> curious if one has a suggestion about what to look at next for
> > > >>>> debugging or a theory about what might go wrong. If nothing else
> comes
> > > >>>> up I'll try to set up a bisect run tomorrow.
> > > >>>
> > > >>> Yeah, git bisect is what I'd start with.
> > > >>
> > > >> Bisect complete, identified this commit
> > > >>
> > > >> commit 00268e00027459abede448662f8794d78eb4b0a4
> > > >> Author: Xiaoyao Li <xiaoyao...@intel.com>
> > > >> Date:   Tue Mar 4 00:24:50 2025 -0500
> > > >>
> > > >>      i386/cpu: Warn about why CPUID_EXT_PDCM is not available
> > > >>
> > > >>      When user requests PDCM explicitly via "+pdcm" without PMU
> enabled, emit
> > > >>      a warning to inform the user.
> > > >>
> > > >>      Signed-off-by: Xiaoyao Li <xiaoyao...@intel.com>
> > > >>      Reviewed-by: Zhao Liu <zhao1....@intel.com>
> > > >>      Link:
> https://lore.kernel.org/r/20250304052450.465445-3-xiaoyao...@intel.com
> > > >>      Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> > > >>
> > > >>   target/i386/cpu.c | 3 +++
> > > >>   1 file changed, 3 insertions(+)
> > > >>
> > > >>
> > > >>
> > > >> Which is odd as it should only add a warning right?
> > > >
> > > > No, that commit message is misleading.
> > > >
> > > > IIUC mark_unavailable_features() actively blocks usage of the
> feature,
> > > > so it is a functional change, not merely a emitting warning.
> > > >
> > > > It makes me wonder if that commit was actually intended to block the
> > > > feature or not, vs merely warning ?  CC'ing those involved in the
> > > > commit.
> > > We can revert the commit.  I'll send the revert to Stefan and let him
> > > decide whether to include it in 10.1-rc4 or delay to 10.2 and 10.1.1.
> >
> > Thanks Paolo for considering that.
> >
> > My steps to reproduce seemed really clear and are 100% reproducible
> > for me, but no one so far said "yeah they see it too", so I'm getting
> > unsure if it was not tried by anyone else or if there is more to it
> > than we yet know.
> > Further I tested more with the commit reverted, and found that at
> > least cross version migrations (9.2 -> 10.1) still have issues that
> > seem related - complaining about pdcm as missing feature.
> > But that was in a log of a test system that went away and ... you know
> > how these things can sometimes be, that new result is not yet very
> > reliable.
> >
> > I intended to check the following matrix more deeply again with and
> > without the reverted change and then come back to this thread:
> >
> > #1 Compare platforms
> > - Migrating between non containerized hosts to verify if they are
> > affected as well
> > - Power management explicitly switched off/on (vs the auto detect of
> > host-model) in the guest XML
> > #2 Retest the different Use-cases I've seen this pop up
> > - 10.1 managed save (broken unless reverting the commit that was
> identified)
> > - 9.2 -> 10.1 migration (seems broken even with the revert)
>
> I need to come back to this aspect of it - the cross release or cross
> qemu version migrations.
>
> Hector (on CC) helps me on that now - sadly we were able to confirm
> that migrations from older qemu versions no longer work.
> Yep 10.1 is released by now so it might end up as "The problem is what
> happens when we detect after we have done a release that something has
> gone wrong" from [2].
> But I still can't believe only we see this and therefore for now want
> to believe I messed up on our side when merging 10.1 :-)
>
> For now this is a call if others have also seen any older release
> migrating to 10.1 to throw:
>   error: operation failed: guest CPU doesn't match specification:
> missing features: pdcm,arch-capabilities
>
> Hector will later today reply here with a summary of what we found so
> far, to provide you a more complete picture to think about, without
> having to read through all the messy interim steps in the Ubuntu bug.
>
>
Indeed, we experience this error at migration from older QEMU versions to
the QEMU 10.1


$ virsh migrate --unsafe --live test-migration qemu+ssh://10.105.100.188/system
error: operation failed: guest CPU doesn't match specification:
missing features: pdcm,arch-capabilities

The domain definition used to reproduce this issue is quite simple:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>test-migration</name>
  <memory unit='GiB'>2</memory>
  <os>
    <type arch='x86_64' machine='q35'>hvm</type>
  </os>
  <cpu mode='host-model' check='partial'/>
</domain>

Here are the -machine and -cpu blocks of QEMU command-lines for a
migration between
QEMU 9.2 and QEMU 10.1 on a Intel Haswell CPU:

Origin: QEMU 9.2

...
-machine pc-q35-9.2,usb=off,dump-guest-core=off,memory-backend=pc.ram,acpi=off
\
-cpu Haswell-noTSX-IBRS,vmx=on,pdcm=on,f16c=on,rdrand=on,
 hypervisor=on,vme=on,ss=on,arat=on,tsc-adjust=on,zero-fcs-fds=on,
 umip=on,md-clear=on,stibp=on,flush-l1d=on,arch-capabilities=on,
 ssbd=on,xsaveopt=on,abm=on,pdpe1gb=on,ibpb=on,ibrs=on,amd-stibp=on,
 amd-ssbd=on,skip-l1dfl-vmentry=on,pschange-mc-no=on,gds-no=on,rfds-no=on,
<vmx-*>
...

<vmx-*> : the vmx-* block is removed for better clarity

Target: QEMU 10.1

...
-machine pc-q35-9.2,usb=off,dump-guest-core=off,memory-backend=pc.ram,acpi=off
\
-cpu Haswell-noTSX-IBRS,vmx=on,pdcm=on,f16c=on,rdrand=on,
 hypervisor=on,vme=on,ss=on,arat=on,tsc-adjust=on,zero-fcs-fds=on,
 umip=on,md-clear=on,stibp=on,flush-l1d=on,arch-capabilities=on,
 ssbd=on,xsaveopt=onabm=on,pdpe1gb=on,ibpb=on,ibrs=on,amd-stibp=on,
 
amd-ssbd=on,skip-l1dfl-vmentry=on,pschange-mc-no=on,gds-no=on,rfds-no=on,vmx-activity-wait-sipi=on,vmx-pml=on
...

This migration failure can be broken down to 2 separate issues, each
one is related to one missing feature: pdcm & arch-capabilities.
Based on our best understanding of the moment, the behavior of QEMU on
these 2 features has been changed recently in 10.1.

- arch-capabilities

  https://github.com/qemu/qemu/commit/d3a24134e37d57abd3e7445842cda2717f49e96d
  (target/i386: do not expose ARCH_CAPABILITIES on AMD CPU)

- pdcm
  https://github.com/qemu/qemu/commit/e68ec2980901c8e7f948f3305770962806c53f0b
  (i386/cpu: Move adjustment of CPUID_EXT_PDCM before
feature_dependencies[] check)

  this commit makes QEMU disable the pdcm if PMU is off, I think on
previous QEMU versions,
  this is also the expected behavior but there is a bug that is fixed
in the commit.
  When I enable the PMU in the guest definition:
    <features>
    <pmu state='on'/>
    </features>
  The missing pdcm feature error disappears.

If we revert both these two commits, the migration works.

We are looking into potential solutions to this migration issue and
according to the documenation [1],
our failure might fall into the 4th case:
  $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2

Presumably, it is necessary to add some compability properties to make
the new behavior on pdcm and
arch-capabilities compatible with older QEMU versions, but as Christian said -
10.1 is already released so it might be more complex now

## Other failed combinations

We looked into all the failing migration combinations we might have in
our supported releases.
We can confirm that the migration is also broken for other QEMU
versions we support in various Ubuntu releases:
(F = Focal, J = Jammy, N = Noble, P = Plucky, Q = Questing)

F-4.2 -> Q-10.1
J-6.2 -> Q-10.1
N-8.2 -> Q-10.1
P-9.2 -> Q-10.1

Maybe worth to note that these combinations are tested by our
automated tests and they just leave
cpu unspecified allowing libvirt to pick safe defaults instead of
using host-model as shown in the sample
domain definition above.

[1] 
https://www.qemu.org/docs/master/devel/migration/compatibility.html#how-to-mitigate-when-we-have-a-backward-compatibility-error



> [1]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2121787
> [2]:
> https://gitlab.com/qemu-project/qemu/-/blob/master/docs/devel/migration/compatibility.rst?plain=1#L322
>
> > The hope was that these will help to further identify what is going
> > on, but despite the urgency of the release being imminent I have not
> > yet managed to find the time in the last two days :-/
> >
> > > Sorry for the delay in answering (and thanks Daniel for bringing this
> to
> > > my attention).
> > >
> > > Thanks,
> > >
> > > Paolo
> > >
> >
> >
> > --
> > Christian Ehrhardt
> > Director of Engineering, Ubuntu Server
> > Canonical Ltd
>
>
>
> --
> Christian Ehrhardt
> Director of Engineering, Ubuntu Server
> Canonical Ltd
>


-- 
Hector CAO
Software Engineer – Partner Engineering Team
hector....@canonical.com
https://launc <https://launchpad.net/~hectorcao>hpad.net/~hectorcao
<https://launchpad.net/~hectorcao>

<https://launchpad.net/~hectorcao>

Reply via email to