On 1/25/2026 1:25 AM, Lukas Wunner wrote:
> Correctable and Uncorrectable Error Status Registers on reporting agents
> are cleared upon PCI device enumeration in pci_aer_init() to flush past
> events.  They're cleared again when an error is handled by the AER driver.
> 
> If an agent reports a new error after pci_aer_init() and before the AER
> driver has probed on the corresponding Root Port or Root Complex Event
> Collector, that error is not handled by the AER driver:  It clears the
> Root Error Status Register on probe, but neglects to re-clear the
> Correctable and Uncorrectable Error Status Registers on reporting agents.
> 
> The error will eventually be reported when another error occurs.  Which
> is irritating because to an end user it appears as if the earlier error
> has just happened.
> 
> Amend the AER driver to clear stale errors on reporting agents upon probe.
> 
> Skip reporting agents which have not invoked pci_aer_init() yet to avoid
> using an uninitialized pdev->aer_cap.  They're recognizable by the error
> bits in the Device Control register still being clear.
> 
> Reporting agents may execute pci_aer_init() after the AER driver has
> probed, particularly when devices are hotplugged or removed/rescanned via
> sysfs.  For this reason, it continues to be necessary that pci_aer_init()
> clears Correctable and Uncorrectable Error Status Registers.
> 

Can you include details about where and in what configuration you observed 
this issue?

Reviewed-by: Kuppuswamy Sathyanarayanan 
<[email protected]>


> Reported-by: Lucas Van <[email protected]> # off-list
> Tested-by: Lucas Van <[email protected]>
> Signed-off-by: Lukas Wunner <[email protected]>
> ---
>  drivers/pci/pcie/aer.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e0bcaa8..4299c55 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1608,6 +1608,20 @@ static void aer_disable_irq(struct pci_dev *pdev)
>       pci_write_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, reg32);
>  }
>  
> +static int clear_status_iter(struct pci_dev *dev, void *data)
> +{
> +     u16 devctl;
> +
> +     /* Skip if pci_enable_pcie_error_reporting() hasn't been called yet */
> +     pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &devctl);
> +     if (!(devctl & PCI_EXP_AER_FLAGS))
> +             return 0;
> +
> +     pci_aer_clear_status(dev);
> +     pcie_clear_device_status(dev);

Should pci_aer_init() also clear device status along with uncor/cor error 
status?

> +     return 0;
> +}
> +
>  /**
>   * aer_enable_rootport - enable Root Port's interrupts when receiving 
> messages
>   * @rpc: pointer to a Root Port data structure
> @@ -1629,9 +1643,19 @@ static void aer_enable_rootport(struct aer_rpc *rpc)
>       pcie_capability_clear_word(pdev, PCI_EXP_RTCTL,
>                                  SYSTEM_ERROR_INTR_ON_MESG_MASK);
>  
> -     /* Clear error status */
> +     /* Clear error status of this Root Port or RCEC */
>       pci_read_config_dword(pdev, aer + PCI_ERR_ROOT_STATUS, &reg32);
>       pci_write_config_dword(pdev, aer + PCI_ERR_ROOT_STATUS, reg32);
> +
> +     /* Clear error status of agents reporting to this Root Port or RCEC */
> +     if (reg32 & AER_ERR_STATUS_MASK) {
> +             if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
> +                     pcie_walk_rcec(pdev, clear_status_iter, NULL);
> +             else if (pdev->subordinate)
> +                     pci_walk_bus(pdev->subordinate, clear_status_iter,
> +                                  NULL);
> +     }
> +
>       pci_read_config_dword(pdev, aer + PCI_ERR_COR_STATUS, &reg32);
>       pci_write_config_dword(pdev, aer + PCI_ERR_COR_STATUS, reg32);
>       pci_read_config_dword(pdev, aer + PCI_ERR_UNCOR_STATUS, &reg32);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Reply via email to