On Tue, 12 Aug 2025 12:07:56 +0100, Coiby Xu <c...@redhat.com> wrote: > > On Tue, Aug 12, 2025 at 11:17:04AM +0100, Marc Zyngier wrote: > > On Tue, 12 Aug 2025 11:09:12 +0100, > > Coiby Xu <c...@redhat.com> wrote: > >> > >> On Mon, Aug 11, 2025 at 03:52:04PM +0100, Marc Zyngier wrote: > >> > On Mon, 11 Aug 2025 14:03:21 +0100, > >> > Thomas Gleixner <t...@linutronix.de> wrote: > >> >> > >> >> On Mon, Aug 11 2025 at 15:02, Thomas Gleixner wrote: > >> >> > >> >> CC+ Marc > >> >> > >> >> > On Mon, Aug 11 2025 at 11:23, Coiby Xu wrote: > >> >> >> Recently I met an issue that on certain virtual machines, the kdump > >> >> >> kernel fails to get DHCP IP address most of times starting from > >> >> >> 6.11-rc2. git bisection shows commit b5712bf89b4b > >> >> >> ("irqchip/gic-v3-its: > >> >> >> Provide MSI parent for PCI/MSI[-X]") is the 1st bad commit, > >> >> >> > >> >> >> # good: [7d189c77106ed6df09829f7a419e35ada67b2bd0] PCI/MSI: > >> >> >> Provide > >> >> >> # MSI_FLAG_PCI_MSI_MASK_PARENT > >> >> >> git bisect good 7d189c77106ed6df09829f7a419e35ada67b2bd0 > >> >> >> # good: [48f71d56e2b87839052d2a2ec32fc97a79c3e264] > >> >> >> irqchip/gic-v3-its: > >> >> >> # Provide MSI parent infrastructure > >> >> >> git bisect good 48f71d56e2b87839052d2a2ec32fc97a79c3e264 > >> >> >> # good: [8c41ccec839c622b2d1be769a95405e4e9a4cb20] > >> >> >> irqchip/irq-msi-lib: > >> >> >> # Prepare for PCI MSI/MSIX > >> >> >> git bisect good 8c41ccec839c622b2d1be769a95405e4e9a4cb20 > >> >> >> # first bad commit: [b5712bf89b4bbc5bcc9ebde8753ad222f1f68296] > >> >> >> # irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X] > >> >> > > >> >> > There were follow up fixes on this, so isolating this one is not > >> >> > really > >> >> > conclusive. > >> >> > > >> >> > Is the problem still there on v6.16 and v6.17-rc1? > >> > > >> > Yeah, there are way too many things that have been addressed since. > >> > kdump is also a particularly nasty case, as it tends to rely on the > >> > redistributor tables programmed by the previous kernel. > >> > >> Thanks for providing a clue. This may also explain explain why I fails > >> to reproduce this issue against 1st kernel even with the same cmdline of > >> the kdump kernel. > > > > I'm not sure that's a clue. It's only an indication that things are > > not necessarily easy to spot. > > > > Has it ever been reproduced on bare metal? Have you tried v6.16 as > > instructed? > > Thanks for replying so quickly! > > No, I haven't reproduced it on a bare metal machine and our QE engineers > haven't noticed this issue on any bare metal machine either. > And I can confirm this issue still happens to 6.16.0-200.fc42.aarch64 > and 6.17.0-0.rc1.17.fc43.aarch64 on the type of KVM VMS (QEMU PnP device > PNP0c02) where the issue was found.
What is that device? Is that the emulated PCI bridge? > >> > Also, this says "virtual machines". What's the hypervisor? > >> > >> I'll contact the lab administrator. What kinds of info I should collect > >> to help you narrow down the issue? > > > > Surely you know what hypervisor you're running on, right? > > Yes, the hypervisor is KVM. Sorry, I thought merely providing the > hypervisor info isn't sufficient and also misunderstood your request as > providing more details on the host machine. Well, knowing that it is KVM is definitely relevant, given that this is my own turf. > >> > How hard is it to reproduce? > >> > >> It can be reproduced reliably on certain machines. But as of writing I > >> haven't reproduced it on other KVM virtual machines on three different > >> host machines. > > > > Which machines? I'm sorry, but if you want help on this, you'll have > > to provide actual information. > > Sorry, I didn't mean to be vague. I thought you question is on how > reproducible this issue is and there is no need to provide the details > on the machines where I can't reproduce this issue. Since you explicitly > request it, I'll be glad to share the details. > > I just grabbed three arbitrary bare metal machines having Fedora-42 > installed and launched some KVM VMs to see if this issue can be > reproduced easily. Two host machines are as follows (sorry I can't find > the info of the 3rd one) > - GIGABYTE PnP device PNP0c02, ARMv8 (M128-30) > - LTHPCSR112 (01234567890123456789AB), ARMv8 (Q80-30) Are these both Ampere Altra boxes? > The virtual machine image is downloaded from > https://download.fedoraproject.org/pub/fedora/linux/releases/42/Cloud/aarch64/images/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2. > I tried different vCPUs (2, 4), different RAM (4G, 35G) and also two > different UEFI firmware (the default one and one from edk2-experimental > package) but haven't reproduced this issue so far. Hold on. Above, you say that you have reproduced it with 6.16.0-200.fc42.aarch64. So have you, or have you not reproduced it? Can you at the very least share: - the boot log of the guest on its first kernel - the boot log of the guest running kdump - the content of /sys/kernel/debug/kvm/$PID-xx/vgic*state* when running both kernels - the QEMU command-line to get to run the whole thing Thanks, M. -- Without deviation from the norm, progress is not possible.