Current behavior has existed for a long time and I could easily imagine that there is software which relies on the fact that the system is in a non-modified state if AER recovery failed. The software can analyze the system and do cleanup afterwards. Sometimes, if something fails in the system, it is better to have it in a non-modified state. In short, I just wanted to preserve the current logic by default because there is a chance that we have software which relies on the current behavior.

On 5/18/26 21:29, Bjorn Helgaas wrote:
[+cc Lukas]

On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote:
pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
If a new AER error is subsequently reported, the AER driver calls
find_source_device() to find the source of the error. It rescans the
whole bus and picks the first device reporting an AER error. Because the
previous error was never cleared, the error is attributed to the wrong
device and AER recovery is started for the wrong device.

Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear
AER error status even when recovery fails, preventing stale errors from
causing incorrect device identification on subsequent AER events.
Why should we add a kernel parameter for this?  How would a user
decide whether to use the parameter?  Are there cases where we
find the source of the first error, but we *wouldn't* want to clear
it if recovery fails?

Reply via email to