Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Hans Zhang Mon, 19 May 2025 07:40:42 -0700



On 2025/5/19 22:21, Hans Zhang wrote:

On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
On 5/16/25 9:55 AM, Hans Zhang wrote:
The following series introduces a new kernel command-line optionaer_panic
to enhance error handling for PCIe Advanced Error Reporting (AER) in
mission-critical environments. This feature ensures deterministicrecoverfrom fatal PCIe errors by triggering a controlled kernel panic whendevice
recovery fails, avoiding indefinite system hangs.
Why would a device recovery failure lead to a system hang? Worst case
that device may not be accessible, right?  Any real use case?
Dear Sathyanarayanan,
Due to Synopsys and Cadence PCIe IP, their AER interrupts are usuallySPI interrupts, not INTx/MSI/MSIx interrupts. (Some customers willdesign it as an MSI/MSIx interrupt, e.g.: RK3588, but not all customershave designed it this way.) For example, when many mobile phone SoCs ofQualcomm handle AER interrupts and there is a link down, that is, afatal problem occurs in the current PCIe physical link, the systemcannot recover. At this point, a system restart is needed to solve theproblem.
And our company design of SOC: http://radxa.com/products/orion/o6/, ithas 5 road PCIe port.There is also the same problem. If there is a problem with one of thePCIe ports, it will cause the entire system to hang. So I hope linux OScan offer an option that enables SOC manufacturers to choose to restartthe system in case of fatal hardware errors occurring in PCIe.
There are also products such as mobile phones and tablets. We don'twant to wait until the battery is completely used up before restartingthem.
For the specific code of Qualcomm, please refer to the email I sent.


Dear Sathyanarayanan,

Supplementary reasons:

drivers/pci/controller/cadence/pcie-cadence-host.c
cdns_pci_map_bus
    /* Clear AXI link-down status */
    cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);

https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52

If there has been a link down in this PCIe port, the registerCDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission tocontinue. This is different from Synopsys.

If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD savingfiles, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causesCPU Core1 to be unable to send TLP transfers and hang. This is a veryextreme situation.(The current Cadence code is Legacy PCIe IP, and the HPA IP is still inthe upstream process at present.)


Radxa O6 uses Cadence's PCIe HPA IP.
http://radxa.com/products/orion/o6/

Best regards,
Hans


Problem Statement
In systems where unresolved PCIe errors (e.g., bus hangs) occur,
traditional error recovery mechanisms may leave the system unresponsive
indefinitely. This is unacceptable for high-availability environment
requiring prompt recovery via reboot.

Solution
The aer_panic option forces a kernel panic on unrecoverable AER errors.
This bypasses prolonged recovery attempts and ensures immediate reboot.

Patch Summary:

Documentation Update: Adds aer_panic to kernel-parameters.txt,explaining

its purpose and usage.

Command-Line Handling: Implements pci=aer_panic parsing and state
management in PCI core.

State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
mode is active.

Panic Trigger: Modifies recovery logic to panic the system when recovery
fails and aer_panic is enabled.

Impact
Controlled Recovery: Reduces downtime by replacing hangs with immediate
reboots.

Optional: Enabled via pci=aer_panic; no default behavior change.

Dependency: Requires CONFIG_PCIEAER.

For example, in mobile phones and tablets, when there is a problem with
the PCIe link and it cannot be restored, it is expected to provide an
alternative method to make the system panic without waiting for the
battery power to be completely exhausted before restarting the system.

---
For example, the sm8250 and sm8350 of qcom will panic and restart the
system when they are linked down.

https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440

https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950


Since the design schemes of each SOC manufacturer are different, the AXI

and other buses connected by PCIe do not have a design to preventhanging.

Once a FATAL error occurs in the PCIe link and cannot be restored, the
system needs to be restarted.


Dear Mani,

I wonder if you know how other SoCs of qcom handle FATAL errors thatoccur

in PCIe link.
---

Hans Zhang (4):
   pci: implement "pci=aer_panic"
   PCI/AER: Introduce aer_panic kernel command-line option
   PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
   PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set

  .../admin-guide/kernel-parameters.txt          |  7 +++++++
  drivers/pci/pci.c                              |  2 ++
  drivers/pci/pci.h                              |  4 ++++
  drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
  drivers/pci/pcie/err.c                         |  8 ++++++--
  5 files changed, 37 insertions(+), 2 deletions(-)


base-commit: fee3e843b309444f48157e2188efa6818bae85cf
prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Reply via email to