DPC: Run recovery on device that detected the error

Shuai Xue Wed, 28 Jan 2026 04:28:33 -0800



On 1/27/26 6:24 PM, Jonathan Cameron wrote:

On Sat, 24 Jan 2026 15:45:54 +0800
Shuai Xue <[email protected]> wrote:

The current implementation of pcie_do_recovery() assumes that the
recovery process is executed for the device that detected the error.
However, the DPC driver currently passes the error port that experienced
the DPC event to pcie_do_recovery().

Use the SOURCE ID register to correctly identify the device that
detected the error. When passing the error device, the
pcie_do_recovery() will find the upstream bridge and walk bridges
potentially AER affected. And subsequent commits will be able to
accurately access AER status of the error device.

Should not observe any functional changes.

Reviewed-by: Kuppuswamy Sathyanarayanan 
<[email protected]>
Signed-off-by: Shuai Xue <[email protected]>
---
  drivers/pci/pci.h      |  2 +-
  drivers/pci/pcie/dpc.c | 25 +++++++++++++++++++++----
  drivers/pci/pcie/edr.c |  7 ++++---
  3 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 0e67014aa001..58640e656897 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -771,7 +771,7 @@ struct rcec_ea {
  void pci_save_dpc_state(struct pci_dev *dev);
  void pci_restore_dpc_state(struct pci_dev *dev);
  void pci_dpc_init(struct pci_dev *pdev);
-void dpc_process_error(struct pci_dev *pdev);
+struct pci_dev *dpc_process_error(struct pci_dev *pdev);
  pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
  bool pci_dpc_recovered(struct pci_dev *pdev);
  unsigned int dpc_tlp_log_len(struct pci_dev *dev);
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index bff29726c6a5..f6069f621683 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -260,10 +260,20 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev 
*dev,
        return 1;
  }

-void dpc_process_error(struct pci_dev *pdev)

+/**
+ * dpc_process_error - handle the DPC error status
+ * @pdev: the port that experienced the containment event
+ *
+ * Return: the device that detected the error.
+ *
+ * NOTE: The device reference count is increased, the caller must decrement
+ * the reference count by calling pci_dev_put().
+ */
+struct pci_dev *dpc_process_error(struct pci_dev *pdev)


Maybe it makes sense to carry the err_port naming for the pci_dev
in here as well?  Seems stronger than just relying on people
reading the documentation you've added.


Good point. I think renaming the parameter would improve clarity. However,
I'd prefer to handle it in a separate patch to keep this change focused on
the functional modification. Would that work for you?

  {
        u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
        struct aer_err_info info = {};
+       struct pci_dev *err_dev;

pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);@@ -279,6 +289,7 @@ void dpc_process_error(struct pci_dev *pdev)

                        pci_aer_clear_nonfatal_status(pdev);
                        pci_aer_clear_fatal_status(pdev);
                }
+               err_dev = pci_dev_get(pdev);
                break;
        case PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE:
        case PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE:
@@ -290,6 +301,8 @@ void dpc_process_error(struct pci_dev *pdev)
                                "ERR_FATAL" : "ERR_NONFATAL",
                         pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
                         PCI_SLOT(source), PCI_FUNC(source));
+               err_dev = pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus),
+                                           PCI_BUS_NUM(source), source & 0xff);


Bunch of replication in her with the pci_warn(). Maybe some local variables?
Maybe even rebuild the final parameter from PCI_DEVFN(slot, func) just to make 
the
association with the print really obvious?


Agreed. Here's the improved version:

--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -293,17 +293,28 @@ struct pci_dev *dpc_process_error(struct pci_dev *pdev)
                break;
        case PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE:
        case PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE:
-               pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID,
-                                    &source);
+       {
+               int domain, bus;
+               u8 slot, func, devfn;
+               const char *err_type;
+
+               pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, 
&source);
+
+               /* Extract source device location */
+               domain = pci_domain_nr(pdev->bus);
+               bus = PCI_BUS_NUM(source);
+               slot = PCI_SLOT(source);
+               func = PCI_FUNC(source);
+               devfn = PCI_DEVFN(slot, func);
+
+               err_type = (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
+                          "ERR_FATAL" : "ERR_NONFATAL";
+
                pci_warn(pdev, "containment event, status:%#06x, %s received from 
%04x:%02x:%02x.%d\n",
-                        status,
-                        (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
-                               "ERR_FATAL" : "ERR_NONFATAL",
-                        pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
-                        PCI_SLOT(source), PCI_FUNC(source));
-               err_dev = pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus),
-                                           PCI_BUS_NUM(source), source & 0xff);
+                        status, err_type, domain, bus, slot, func);
+               err_dev = pci_get_domain_bus_and_slot(domain, bus, devfn);
                break;
+       }
        case PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT:
                ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
                pci_warn(pdev, "containment event, status:%#06x: %s detected\n",


Is there any chance that this might return NULL?   Feels like maybe that's
only a possibility on a broken setup, but I'm not sure of all the wonderful
races around hotplug and DPC occurring before the OS has caught up.


Surprise Down events are handled separately in
dpc_handle_surprise_removal() and won't reach dpc_process_error().
Please correct me if I missed anything.

Thanks for valuable comments.

Best Regards,
Shuai

Re: [PATCH v7 2/5] PCI/DPC: Run recovery on device that detected the error

Reply via email to