On Tue, Jan 27, 2026 at 05:00:55PM -0600, Bjorn Helgaas wrote:
> On Sun, Jan 25, 2026 at 10:25:51AM +0100, Lukas Wunner wrote:
> > Correctable and Uncorrectable Error Status Registers on reporting agents
> > are cleared upon PCI device enumeration in pci_aer_init() to flush past
> > events.  They're cleared again when an error is handled by the AER driver.
> 
> Do you think pci_aer_init() is the right time to clear the error
> status bits?  Most of those bits are sticky, so they're not cleared by
> reset.
> 
> I'm thinking about the scenario where a PCIe error occurs is captured
> in the AER error status registers, but the system reboots before the
> AER driver can log the error.  Since the bits are sticky, the new
> kernel might have a chance to find and log the error that happened
> with the previous kernel.

I agree that *reporting* errors instead of just silently *clearing* them
could be useful.

We cannot pinpoint when the errors occurred, so we'd have to mark them
in the log messages as having occurred "during shutdown or early boot"
or "during suspend or resume" (for errors occurring during a system sleep
cycle).  But that could still be good enough and helpful for users.

We could report them with KERN_INFO severity and if that turns out to be
too noisy, demote them to KERN_DEBUG or exempt certain error types
(such as Unsupported Requests).

Shuai Xue and I had a discussion late last year about reporting
versus silently clearing stale errors:
https://lore.kernel.org/all/[email protected]/

I think we were both unsure back then whether you would entertain a patch
to report stale errors.  But since you're now raising the issue yourself,
I'd say yes, it's worth pursuing.

However I think the $SUBJECT_PATCH still makes sense:  If I were to submit
a series to report stale errors, I'd still first amend the code to clear
all stale errors (instead of leaving some of them uncleared), then amend it
to report errors prior to clearing them.  The $SUBJECT_PATCH is sort of
a fix that distributions may want to backport, whereas *reporting*
stale errors would be a new feature not eligible for backporting.

> So I wonder if pci_aer_init() should just find the Capability and
> alloc its buffers, and aer_probe() should look for existing errors and
> log them before clearing them.

Devices may be enumerated after aer_probe(), e.g. when they're hot-added
below an AER-capable and hotplug-capable Root Port.  For cases like this,
we'll still have to clear (and in the future report) stale errors in
pci_aer_init().

(The $SUBJECT_PATCH takes this into account and explicitly calls out
this corner case in the commit message.)

Thanks,

Lukas

Reply via email to