PUNIT errors can only be recovered using a power-cycle. Xe KMD
sends a uevent to notify userspace to trigger a power cycle.
On platforms where link drop caused by powering the device off and
back on is reported by hardware as a Surprise Link Down (SLD), which
AER then escalates as an Uncorrectable Fatal Error. That error fires
before the device finishes coming back up and defeats the
very recovery we are attempting.

To keep the expected, recovery-induced link drop from being raised as
a fatal AER event, mask the Surprise Link Down bit
(PCI_ERR_UNC_SURPDN) in the upstream port's AER Uncorrectable Error
Mask register before punit_error_handler() requests the cold reset.

Signed-off-by: Mallesh Koujalagi <[email protected]>
---
v6:
- Expand commit message to explain why SUR_DN is masked. (Raag/Riana)
- Check Slot Implemented bit before reading Slot Capabilities, per
  PCIe spec. (Riana)
- Add debug log.

v7:
- Handle suprise link down event properly. (Aravind/Riana)
- Update commit message. (Riana)
- Corret log message.
---
 drivers/gpu/drm/xe/xe_ras.c | 38 +++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
index 9a90a7118e89..acdedf403649 100644
--- a/drivers/gpu/drm/xe/xe_ras.c
+++ b/drivers/gpu/drm/xe/xe_ras.c
@@ -209,8 +209,46 @@ static enum xe_ras_recovery_action 
handle_core_compute_errors(struct xe_ras_erro
        return XE_RAS_RECOVERY_ACTION_RECOVERED;
 }
 
+#ifdef CONFIG_PCIEAER
+static void pcie_suppress_surprise_link_down(struct pci_dev *usp)
+{
+       u32 aer_uncorr_mask;
+       u16 aer_cap;
+
+       aer_cap = usp->aer_cap;
+       if (!aer_cap) {
+               dev_dbg(&usp->dev,
+                       "AER capability not present\n");
+               return;
+       }
+
+       pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, 
&aer_uncorr_mask);
+       aer_uncorr_mask |= PCI_ERR_UNC_SURPDN;
+       pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, 
aer_uncorr_mask);
+       dev_dbg(&usp->dev, "Surprise Link Down masked for cold reset\n");
+}
+#endif /* CONFIG_PCIEAER */
+
 static void punit_error_handler(struct xe_device *xe)
 {
+#ifdef CONFIG_PCIEAER
+       struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
+       struct pci_dev *vsp, *usp;
+
+       /*
+        * Device Hierarchy:
+        *
+        * Root Port --> Upstream Switch Port (USP) --> Virtual Switch Port 
(VSP) --> SGunit
+        *
+        * Cold reset power-cycles the slot, dropping the PCIe link. The
+        * slot triggers a spurious Surprise Link Down AER event on the USP.
+        */
+       vsp = pci_upstream_bridge(pdev);
+       usp = vsp ? pci_upstream_bridge(vsp) : NULL;
+
+       if (usp)
+               pcie_suppress_surprise_link_down(usp);
+#endif
        xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET);
        xe_device_declare_wedged(xe);
 }
-- 
2.34.1

Reply via email to