On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote:
> pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
> If a new AER error is subsequently reported, the AER driver calls
> find_source_device() to find the source of the error. It rescans the
> whole bus and picks the first device reporting an AER error. Because the
> previous error was never cleared, the error is attributed to the wrong
> device and AER recovery is started for the wrong device.

I guess the rationale of the current behavior is that the devices
affected by the failed error recovery are basically in a broken
state once error recovery failed and so user intervention is
required, e.g. a remove/rescan via sysfs.

My question is, why is error recovery failing for the devices
in the first place?

And what does the hierarchy look like?
(lspci -tv and lspci -vvv output please)

I also don't quite follow your assertion that (only) the first device
reporting an error is picked.  The algorithm tries to collect *all*
error-reporting devices in the affected portion of the hierarchy.

Thanks,

Lukas

Reply via email to