Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Sathyanarayanan Kuppuswamy Wed, 21 May 2025 09:17:45 -0700


On 5/21/25 7:54 AM, Hans Zhang wrote:



On 2025/5/21 00:09, Sathyanarayanan Kuppuswamy wrote:


On 5/19/25 7:41 AM, Hans Zhang wrote:



On 2025/5/19 22:21, Hans Zhang wrote:



On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:


On 5/16/25 9:55 AM, Hans Zhang wrote:

The following series introduces a new kernel command-line option aer_panic
to enhance error handling for PCIe Advanced Error Reporting (AER) in
mission-critical environments. This feature ensures deterministic recover
from fatal PCIe errors by triggering a controlled kernel panic when device
recovery fails, avoiding indefinite system hangs.


Why would a device recovery failure lead to a system hang? Worst case
that device may not be accessible, right?  Any real use case?



Dear Sathyanarayanan,

Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually SPI 
interrupts, not INTx/MSI/MSIx interrupts. (Some customers will design it as an 
MSI/MSIx interrupt, e.g.: RK3588, but not all customers have designed it this 
way.)  For example, when many mobile phone SoCs of Qualcomm handle AER 
interrupts and there is a link down, that is, a fatal problem occurs in the 
current PCIe physical link, the system cannot recover.  At this point, a system 
restart is needed to solve the problem.

And our company design of SOC: http://radxa.com/products/orion/o6/, it has 5 
road PCIe port.
There is also the same problem.  If there is a problem with one of the PCIe 
ports, it will cause the entire system to hang.  So I hope linux OS can offer 
an option that enables SOC manufacturers to choose to restart the system in 
case of fatal hardware errors occurring in PCIe.

There are also products such as mobile phones and tablets. We don't want to 
wait until the battery is completely used up before restarting them.

For the specific code of Qualcomm, please refer to the email I sent.



Dear Sathyanarayanan,

Supplementary reasons:

drivers/pci/controller/cadence/pcie-cadence-host.c
cdns_pci_map_bus
    /* Clear AXI link-down status */
    cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);

https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52

If there has been a link down in this PCIe port, the register 
CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to continue.  
This is different from Synopsys.

If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving files, 
since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes CPU Core1 to be 
unable to send TLP transfers and hang. This is a very extreme situation.
(The current Cadence code is Legacy PCIe IP, and the HPA IP is still in the 
upstream process at present.)

Radxa O6 uses Cadence's PCIe HPA IP.
http://radxa.com/products/orion/o6/


It sounds like a system level issue to me. Why not they rely on watchdog to 
reboot for
this case ?


Dear Sathyanarayanan,

Thank you for your reply. Yes, personally, I think it's also a problem at the 
system level. I conducted a local test. When I directly unplugged the EP device 
on the slot, the system would hang. It has been tested many times. Since we 
don't have a bus timeout response mechanism for PCIe, it hangs easily.


Any comment on why watchdog is not used to reboot the unresponsive system?


Even if you want to add this support, I think it is more appropriate to add 
this to your
specific PCIe controller driver.  I don't see why you want to add it part of 
generic
AER driver.

Because we want to use the processing logic of the general AER driver. If the 
recovery is successful, there will be no problem. If the recovery fails, my 
original intention was to restart the system.

If added to the specific PCIe controller driver, a lot of repetitive AER 
processing logic will be written. So I was thinking whether the AER driver 
could be changed to be compiled as a KO module.


May be you can rely on err handler callbacks to get notification on fatal 
errors or you can even use uevent handler to detect the disconnected device 
event and handle it there.



If this series is not reasonable, I'll drop it.


Adding new kernel param to solve a specific system issue is not recommended. 
Try to find some custom solution for your chip/controller.



Best regards,
Hans

Problem Statement
In systems where unresolved PCIe errors (e.g., bus hangs) occur,
traditional error recovery mechanisms may leave the system unresponsive
indefinitely. This is unacceptable for high-availability environment
requiring prompt recovery via reboot.

Solution
The aer_panic option forces a kernel panic on unrecoverable AER errors.
This bypasses prolonged recovery attempts and ensures immediate reboot.

Patch Summary:
Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
its purpose and usage.

Command-Line Handling: Implements pci=aer_panic parsing and state
management in PCI core.

State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
mode is active.

Panic Trigger: Modifies recovery logic to panic the system when recovery
fails and aer_panic is enabled.

Impact
Controlled Recovery: Reduces downtime by replacing hangs with immediate
reboots.

Optional: Enabled via pci=aer_panic; no default behavior change.

Dependency: Requires CONFIG_PCIEAER.

For example, in mobile phones and tablets, when there is a problem with
the PCIe link and it cannot be restored, it is expected to provide an
alternative method to make the system panic without waiting for the
battery power to be completely exhausted before restarting the system.

---
For example, the sm8250 and sm8350 of qcom will panic and restart the
system when they are linked down.

https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440

https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950

Since the design schemes of each SOC manufacturer are different, the AXI
and other buses connected by PCIe do not have a design to prevent hanging.
Once a FATAL error occurs in the PCIe link and cannot be restored, the
system needs to be restarted.

Dear Mani,

I wonder if you know how other SoCs of qcom handle FATAL errors that occur
in PCIe link.
---

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Reply via email to