Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1

Marc Zyngier Tue, 12 Aug 2025 09:31:01 -0700

On Tue, 12 Aug 2025 12:07:56 +0100,
Coiby Xu <c...@redhat.com> wrote:
> 
> On Tue, Aug 12, 2025 at 11:17:04AM +0100, Marc Zyngier wrote:
> > On Tue, 12 Aug 2025 11:09:12 +0100,
> > Coiby Xu <c...@redhat.com> wrote:
> >> 
> >> On Mon, Aug 11, 2025 at 03:52:04PM +0100, Marc Zyngier wrote:
> >> > On Mon, 11 Aug 2025 14:03:21 +0100,
> >> > Thomas Gleixner <t...@linutronix.de> wrote:
> >> >>
> >> >> On Mon, Aug 11 2025 at 15:02, Thomas Gleixner wrote:
> >> >>
> >> >> CC+ Marc
> >> >>
> >> >> > On Mon, Aug 11 2025 at 11:23, Coiby Xu wrote:
> >> >> >> Recently I met an issue that on certain virtual machines, the kdump
> >> >> >> kernel fails to get DHCP IP address most of times starting from
> >> >> >> 6.11-rc2. git bisection shows commit b5712bf89b4b 
> >> >> >> ("irqchip/gic-v3-its:
> >> >> >> Provide MSI parent for PCI/MSI[-X]") is the 1st bad commit,
> >> >> >>
> >> >> >>      # good: [7d189c77106ed6df09829f7a419e35ada67b2bd0] PCI/MSI: 
> >> >> >> Provide
> >> >> >>      # MSI_FLAG_PCI_MSI_MASK_PARENT
> >> >> >>      git bisect good 7d189c77106ed6df09829f7a419e35ada67b2bd0
> >> >> >>      # good: [48f71d56e2b87839052d2a2ec32fc97a79c3e264] 
> >> >> >> irqchip/gic-v3-its:
> >> >> >>      # Provide MSI parent infrastructure
> >> >> >>      git bisect good 48f71d56e2b87839052d2a2ec32fc97a79c3e264
> >> >> >>      # good: [8c41ccec839c622b2d1be769a95405e4e9a4cb20] 
> >> >> >> irqchip/irq-msi-lib:
> >> >> >>      # Prepare for PCI MSI/MSIX
> >> >> >>      git bisect good 8c41ccec839c622b2d1be769a95405e4e9a4cb20
> >> >> >>      # first bad commit: [b5712bf89b4bbc5bcc9ebde8753ad222f1f68296]
> >> >> >>      # irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
> >> >> >
> >> >> > There were follow up fixes on this, so isolating this one is not 
> >> >> > really
> >> >> > conclusive.
> >> >> >
> >> >> > Is the problem still there on v6.16 and v6.17-rc1?
> >> >
> >> > Yeah, there are way too many things that have been addressed since.
> >> > kdump is also a particularly nasty case, as it tends to rely on the
> >> > redistributor tables programmed by the previous kernel.
> >> 
> >> Thanks for providing a clue. This may also explain explain why I fails
> >> to reproduce this issue against 1st kernel even with the same cmdline of
> >> the kdump kernel.
> > 
> > I'm not sure that's a clue. It's only an indication that things are
> > not necessarily easy to spot.
> > 
> > Has it ever been reproduced on bare metal? Have you tried v6.16 as
> > instructed?
> 
> Thanks for replying so quickly!
> 
> No, I haven't reproduced it on a bare metal machine and our QE engineers
> haven't noticed this issue on any bare metal machine either. 
> And I can confirm this issue still happens to 6.16.0-200.fc42.aarch64
> and 6.17.0-0.rc1.17.fc43.aarch64 on the type of KVM VMS (QEMU PnP device
> PNP0c02) where the issue was found.


What is that device? Is that the emulated PCI bridge?

> >> > Also, this says "virtual machines". What's the hypervisor?
> >> 
> >> I'll contact the lab administrator. What kinds of info I should collect
> >> to help you narrow down the issue?
> > 
> > Surely you know what hypervisor you're running on, right?
> 
> Yes, the hypervisor is KVM. Sorry, I thought merely providing the
> hypervisor info isn't sufficient and also misunderstood your request as
> providing more details on the host machine.

Well, knowing that it is KVM is definitely relevant, given that this
is my own turf.

> >> > How hard is it to reproduce?
> >> 
> >> It can be reproduced reliably on certain machines. But as of writing I
> >> haven't reproduced it on other KVM virtual machines on three different
> >> host machines.
> > 
> > Which machines? I'm sorry, but if you want help on this, you'll have
> > to provide actual information.
> 
> Sorry, I didn't mean to be vague. I thought you question is on how
> reproducible this issue is and there is no need to provide the details
> on the machines where I can't reproduce this issue. Since you explicitly
> request it, I'll be glad to share the details.
> 
> I just grabbed three arbitrary bare metal machines having Fedora-42
> installed and launched some KVM VMs to see if this issue can be
> reproduced easily. Two host machines are as follows (sorry I can't find
> the info of the 3rd one)
> - GIGABYTE PnP device PNP0c02, ARMv8 (M128-30)
> - LTHPCSR112 (01234567890123456789AB), ARMv8 (Q80-30)

Are these both Ampere Altra boxes?

> The virtual machine image is downloaded from
> https://download.fedoraproject.org/pub/fedora/linux/releases/42/Cloud/aarch64/images/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2.
> I tried different vCPUs (2, 4), different RAM (4G, 35G) and also two
> different UEFI firmware (the default one and one from edk2-experimental
> package) but haven't reproduced this issue so far.

Hold on. Above, you say that you have reproduced it with
6.16.0-200.fc42.aarch64. So have you, or have you not reproduced it?

Can you at the very least share:

- the boot log of the guest on its first kernel

- the boot log of the guest running kdump

- the content of /sys/kernel/debug/kvm/$PID-xx/vgic*state* when
  running both kernels

- the QEMU command-line to get to run the whole thing

Thanks,

        M.

-- 
Without deviation from the norm, progress is not possible.

Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1

Reply via email to