pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0

Michael S. Tsirkin Sat, 27 May 2023 23:41:35 -0700

On Fri, May 26, 2023 at 09:55:22AM +0200, Juan Quintela wrote:
> Jiri Denemark <jdene...@redhat.com> wrote:
> > On Thu, May 11, 2023 at 13:43:47 +0200, Juan Quintela wrote:
> >> "Michael S. Tsirkin" <m...@redhat.com> wrote:
> >> 
> >> [Added libvirt people to the party, see the end of the message ]
> >
> > Sorry, I'm not that much into parties :-)
> >
> >> That would fix the:
> >> 
> >> qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
> >> 
> >> It is worth it?  Dunno.  That is my question.
> >> 
> >> And knowing from what qemu it has migrated from would not help.  We
> >> would need to add a new tweak and means:
> >> 
> >> This is a pc-7.2 machine that has been isntantiated in a qemu-8.0 and
> >> has the pciaerr bug.  But wait, we have _that_.
> >> 
> >> And it is called
> >> 
> >> +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
> >> 
> >> from the patch.
> >> 
> >> We can teach libvirt about this glitch, and if he is migrating a pc-7.2
> >> machine in qemu-8.0 machine, And they want to migrate to a new qemu
> >> (call it qemu-8.1), it needs to be started:
> >> 
> >> qemu-8.1 -M pc-7.2 <whatever pci devices need to 
> >> do>,x-pci-err-unc-mask="true"
> >> 
> >> Until the user reboots it and then that property can be reset to default
> >> value.
> >
> > Hmm and what would happen if eventually this machine gets migrated back
> > to qemu-8.0?
> 
> It works.
> migrating to qemu-7.2 is what is not going to work.
> To migrate to qemu-8.0, you just need to drop the
> "x-pci-err-unc-mask=true" bit.  And it would work.
> 
> So, to be clear, this machine can migrate to:
> 
> - qemu-8.0, you just need to drop the "x-pci-err-unc-mask=true" bit
> 
> - qemu-8.0.1 or newer, you just need to maintain the
>   "x-pci-err-unc-mask=true" bit.
> 
> Let's just assume that qemu-7.2.1 don't get the
> "x-pci-err-unc-mask=true" bit, so it will not be able to migrate there.
> 
> 
> > Or even when the machine is stopped, started again, and
> > then migrated to qemu-8.0?
> 
> If you do what I call a hard reset (i.e. poweroff + poweron so qemu
> dies), you should drop the "x-pci-err-unc-mask=true" bit.  And then you
> can migrate to qemu-7.2 and all qemu-8.0.1 and newer.
> 
> Basically what we need is a "mark" inside libvirt that means something
> like:
> 
> - this is weird machine that looks like pc-7.2
> - but has "x-pci-err-unc-mask=true"
> - so it can only migrate to qemu-8.0 and newer.
> - but if it even reboots in qemu-8.0.1 or newer, we want it back to
>   become a "normal" pc-7.2 machine (i.e. drop the
>   x-pci-err-unc-mask=true).
> 
> That would be the perfect world.  But as we are in an imperfect world,
> something like:
> 
> - this machine started in qemu-8.0 -M pc-7.2, we know this is broken and
>   it can't migrate outside of qemu-8.0 because it would fail to go to
>   either qemu-7.2 or qemu-8.0.1.
> 
> I would argue that if you do the second option doing the "right" option
> i.e. the first one is not much more complicated, but that is a question
> that you should be better to answer.
> 
> And then we have the other Michael question.  How can we export that
> information so libvirt can use it.
> 
> In this case we can comunicate libvirt:
> - In qemu-8.0 we broke pc-7.2.
> - The problem is fixed in qemu-8.0.1 using property
>   "x-pci-err-unc-mask=false".
> - You can migrate from qemu-8.0 in newer if you set that property as
>   true.
> - Guests started in qemu-8.0 -M pc-7.2 should reboot in qemu-8.0.1 or
>   newer to become "normal pc-7.2".
> - If we publish this on qemu, we can only publish it on qemu-8.0.1 and
>   newer.
> - Or we can publish it somewhere else and any libvirt can take this
>   information.
> - Or we can comunicate this to libvirt, and they incorporate it on their
>   source anywhere that you see fit.


And this is not an isolated instance. There are things like this in
almost each release.


My suggestion is a package with known bugs like this.
It would list these work arounds in some machine readable
format and would be essentially append only, making it
relatively safe even for very old RHEL distros to
pick up the latest version once in a while.

E.g. the fact we add bug workaround for 10.0 will not affect
7.2 so you do not need to fork with each release.




> The point here is that when we use a property on a machine type, it can
> be for two reasons:
> 
> - We detected at the right time that we changed the value of something,
>   and we did the right thing on hw_compat_X_Y, so libvirt needs to do
>   nothing.
> 
> - We *DID NOT* detect that we broke compatibility before release, and we
>   need to make a property to identify that problem.  This is where we
>   need to do this dance.
> 
> Notice that normally we detect lots of problems during development and
> this *should* not happen.  But when it happens, we need to be able to do
> something.
> 
> And also notice that normally we broke just some device, not a whole
> machine type.  But as you can see we have broke it this time.  We are
> trying to automate the detection of this kind of failures, but we are
> still on design stage, so we need to plan how to handle this.
> 
> Any comments?
> 
> Later, Juan.
> 
> 
> 
>

Re: [PATCH v1 1/1] hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0

Reply via email to