Thank you for your contribution! Sashiko AI review found 4 potential issue(s) 
to consider:
- [High] The driver permanently masks the Surprise Link Down AER event 
(PCI_ERR_UNC_SURPDN) on the upstream switch port without ever restoring it.
- [High] Lockless read-modify-write of the shared Upstream Switch Port's AER 
mask register creates a race condition.
- [High] Unchecked pci_read_config_dword return value may corrupt the AER mask 
register.
- [High] Attempting to mask Surprise Link Down on an Upstream Port violates the 
PCIe specification and fails to suppress the error.
--

commit b4b812ffca5eaea995b148e8a053e6c4fd45bfe9
Author: Mallesh Koujalagi <[email protected]>

drm/xe: Suppress Surprise Link Down on device

This patch attempts to suppress spurious Surprise Link Down AER events during
a recovery cold reset. It does so by masking the PCI_ERR_UNC_SURPDN bit in
the upstream switch port's AER Uncorrectable Error Mask register before
triggering the cold reset.

> diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
> index 9a90a7118e896..acdedf403649f 100644
> --- a/drivers/gpu/drm/xe/xe_ras.c
> +++ b/drivers/gpu/drm/xe/xe_ras.c

[ ... ]

> +#ifdef CONFIG_PCIEAER
> +static void pcie_suppress_surprise_link_down(struct pci_dev *usp)
> +{
> +     u32 aer_uncorr_mask;
> +     u16 aer_cap;
> +
> +     aer_cap = usp->aer_cap;
> +     if (!aer_cap) {
> +             dev_dbg(&usp->dev,
> +                     "AER capability not present\n");
> +             return;
> +     }
> +
> +     pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, 
> &aer_uncorr_mask);

[Severity: High]
Does this code risk corrupting the AER mask register if the read fails?

If pci_read_config_dword() fails, for example due to temporary bus
unavailability, it can leave aer_uncorr_mask uninitialized or set it to all
1s. Writing that value back could inadvertently mask all uncorrectable errors
or write garbage to the register.

> +     aer_uncorr_mask |= PCI_ERR_UNC_SURPDN;
> +     pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, 
> aer_uncorr_mask);

[Severity: High]
Is it safe to perform a lockless read-modify-write on a shared parent
bridge's configuration space?

Since the upstream switch port is outside the xe driver's ownership, if the
AER core or another driver concurrently modifies this mask register without
holding pci_dev_lock() on the upstream port, could one of the updates be
silently lost?

[Severity: High]
Will this permanently mask the Surprise Link Down AER event on the upstream
switch port?

The configuration space of the port above the cold reset boundary isn't
reset by the slot power cycle, and there does not appear to be any
corresponding code to unmask this bit later. Could this permanently blind
the port to legitimate Surprise Link Down errors until the next system
reboot?

> +     dev_dbg(&usp->dev, "Surprise Link Down masked for cold reset\n");
> +}
> +#endif /* CONFIG_PCIEAER */
> +
> +static void punit_error_handler(struct xe_device *xe)
> +{
> +#ifdef CONFIG_PCIEAER
> +     struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +     struct pci_dev *vsp, *usp;
> +
> +     /*
> +      * Device Hierarchy:
> +      *
> +      * Root Port --> Upstream Switch Port (USP) --> Virtual Switch Port 
> (VSP) --> SGunit
> +      *
> +      * Cold reset power-cycles the slot, dropping the PCIe link. The
> +      * slot triggers a spurious Surprise Link Down AER event on the USP.
> +      */
> +     vsp = pci_upstream_bridge(pdev);
> +     usp = vsp ? pci_upstream_bridge(vsp) : NULL;
> +
> +     if (usp)
> +             pcie_suppress_surprise_link_down(usp);

[Severity: High]
Will masking the Surprise Link Down bit on the Upstream Switch Port actually
suppress the intended error?

According to the PCIe Base Specification, the Surprise Down Error Status bit
is mandatory for Downstream Ports and reserved for all other port types.

Since the Downstream Port (vsp) facing the slot is what actually detects the
link drop and triggers the AER event, does writing to a reserved bit in the
upstream port's config space fail to suppress the spurious AER event?

> +#endif
>       xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET);
>       xe_device_declare_wedged(xe);
>  }

-- 
Sashiko AI review ยท 
https://sashiko.dev/#/patchset/[email protected]?part=5

Reply via email to