On 5/20/26 09:43, Lukas Wunner wrote:
No, is_error_source() considers bus number 0 as a bogus number and will iterate over all devices on the bus.
Right, if address is 0, is_error_source accepts any device with AER error and returns 'true' for any device on a bus. But when is_error_source reports 'true' for the first device on a bus, we stop iterating in find_device_iter. We stop iterating if "e_info->multi_error_valid == 0", that is our case. So, the error is reported only for the first device on a bus.
On 5/20/26 10:02, Lukas Wunner wrote:
One more thing, recovery is failing here because those four devices on bus 46 are unbound. I've got two patches under development to allow error recovery for unbound devices. You may want to try those and see if error recovery succeeds: https://github.com/l1k/linux/commits/aer_unbound
I believe that aer_unbound will fix our issue. I need some time to confirm that the commit fixes our particular issue.
But I think that we still have an issue in the pcie_do_recovery function. It is still possible that AER recovery will fail. The reason could be different, and most of these cases should be investigated and fixed. But until these issues are fixed and there is possibility that AER recovery could fail, users should not experience the kind of issues I described in this patch. For me, it looks better to do an AER error cleanup even if AER recovery failed. We can add an error log with a warning that we did the AER errors cleanup although AER recovery failed.
