On 1/28/26 11:02 PM, Jonathan Cameron wrote:
On Wed, 28 Jan 2026 20:27:31 +0800
Shuai Xue <[email protected]> wrote:

On 1/27/26 6:24 PM, Jonathan Cameron wrote:
On Sat, 24 Jan 2026 15:45:54 +0800
Shuai Xue <[email protected]> wrote:
The current implementation of pcie_do_recovery() assumes that the
recovery process is executed for the device that detected the error.
However, the DPC driver currently passes the error port that experienced
the DPC event to pcie_do_recovery().

Use the SOURCE ID register to correctly identify the device that
detected the error. When passing the error device, the
pcie_do_recovery() will find the upstream bridge and walk bridges
potentially AER affected. And subsequent commits will be able to
accurately access AER status of the error device.

Should not observe any functional changes.

Reviewed-by: Kuppuswamy Sathyanarayanan 
<[email protected]>
Signed-off-by: Shuai Xue <[email protected]>
Hi Shuai,

---
   drivers/pci/pci.h      |  2 +-
   drivers/pci/pcie/dpc.c | 25 +++++++++++++++++++++----
   drivers/pci/pcie/edr.c |  7 ++++---
   3 files changed, 26 insertions(+), 8 deletions(-)

...

-void dpc_process_error(struct pci_dev *pdev)
+/**
+ * dpc_process_error - handle the DPC error status
+ * @pdev: the port that experienced the containment event
+ *
+ * Return: the device that detected the error.
+ *
+ * NOTE: The device reference count is increased, the caller must decrement
+ * the reference count by calling pci_dev_put().
+ */
+struct pci_dev *dpc_process_error(struct pci_dev *pdev)

Maybe it makes sense to carry the err_port naming for the pci_dev
in here as well?  Seems stronger than just relying on people
reading the documentation you've added.

Good point. I think renaming the parameter would improve clarity. However,
I'd prefer to handle it in a separate patch to keep this change focused on
the functional modification. Would that work for you?

Sure. Ideal would be a precursor patch, but if it's much easier to
do on top of this I'm fine with that.

You are absolutely correct that it should be a separate patch!

Got it.

   {
        u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
        struct aer_err_info info = {};
+       struct pci_dev *err_dev;
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status); @@ -279,6 +289,7 @@ void dpc_process_error(struct pci_dev *pdev)
                        pci_aer_clear_nonfatal_status(pdev);
                        pci_aer_clear_fatal_status(pdev);
                }
+               err_dev = pci_dev_get(pdev);
                break;
        case PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE:
        case PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE:
@@ -290,6 +301,8 @@ void dpc_process_error(struct pci_dev *pdev)
                                "ERR_FATAL" : "ERR_NONFATAL",
                         pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
                         PCI_SLOT(source), PCI_FUNC(source));
+               err_dev = pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus),
+                                           PCI_BUS_NUM(source), source & 0xff);

Bunch of replication in her with the pci_warn(). Maybe some local variables?
Maybe even rebuild the final parameter from PCI_DEVFN(slot, func) just to make 
the
association with the print really obvious?

Agreed. Here's the improved version:

--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -293,17 +293,28 @@ struct pci_dev *dpc_process_error(struct pci_dev *pdev)
                  break;
          case PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE:
          case PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE:
-               pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID,
-                                    &source);
+       {
+               int domain, bus;
+               u8 slot, func, devfn;
+               const char *err_type;
+
+               pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, 
&source);
+
+               /* Extract source device location */
+               domain = pci_domain_nr(pdev->bus);
+               bus = PCI_BUS_NUM(source);
+               slot = PCI_SLOT(source);
+               func = PCI_FUNC(source);
+               devfn = PCI_DEVFN(slot, func);
+
+               err_type = (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
+                          "ERR_FATAL" : "ERR_NONFATAL";
+
                  pci_warn(pdev, "containment event, status:%#06x, %s received from 
%04x:%02x:%02x.%d\n",
-                        status,
-                        (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
-                               "ERR_FATAL" : "ERR_NONFATAL",
-                        pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
-                        PCI_SLOT(source), PCI_FUNC(source));
-               err_dev = pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus),
-                                           PCI_BUS_NUM(source), source & 0xff);
+                        status, err_type, domain, bus, slot, func);
+               err_dev = pci_get_domain_bus_and_slot(domain, bus, devfn);
Maybe don't bother with local variables for the things only used once.
e.g.

                err_dev = pci_get_domain_bus_and_slot(domain, bus, 
PCI_DEVFN(slot, func));

and possibly the same for err_type.

I don't mind though if you prefer to use locals for everything.

Sure, will remove local devfn and err_type.




                  break;
+       }
          case PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT:
                  ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
                  pci_warn(pdev, "containment event, status:%#06x: %s 
detected\n",


Is there any chance that this might return NULL?   Feels like maybe that's
only a possibility on a broken setup, but I'm not sure of all the wonderful
races around hotplug and DPC occurring before the OS has caught up.

Surprise Down events are handled separately in
dpc_handle_surprise_removal() and won't reach dpc_process_error().
Please correct me if I missed anything.

Probably fine. I just didn't check particularly closely.

Jonathan


Thanks for valuable comments.

Best Regards,
Shuai


Reply via email to