Re: [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure

Yury M. Mon, 18 May 2026 13:50:19 -0700

Current behavior has existed for a long time and I could easily imaginethat there is software which relies on the fact that the system is in anon-modified state if AER recovery failed. The software can analyze thesystem and do cleanup afterwards. Sometimes, if something fails in thesystem, it is better to have it in a non-modified state.In short, I just wanted to preserve the current logic by default becausethere is a chance that we have software which relies on the currentbehavior.


On 5/18/26 21:29, Bjorn Helgaas wrote:

[+cc Lukas]


On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote:

pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
If a new AER error is subsequently reported, the AER driver calls
find_source_device() to find the source of the error. It rescans the
whole bus and picks the first device reporting an AER error. Because the
previous error was never cleared, the error is attributed to the wrong
device and AER recovery is started for the wrong device.

Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear
AER error status even when recovery fails, preventing stale errors from
causing incorrect device identification on subsequent AER events.

Why should we add a kernel parameter for this?  How would a user
decide whether to use the parameter?  Are there cases where we
find the source of the first error, but we *wouldn't* want to clear
it if recovery fails?

Re: [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure

Reply via email to