On 7/31/25 6:01 AM, Lukas Wunner wrote:
On Wed, Jul 30, 2025 at 10:24:07PM +0200, Lukas Wunner wrote:
On Wed, Jul 30, 2025 at 10:01:50PM +0200, Lukas Wunner wrote:
On Wed, Jul 30, 2025 at 01:20:57PM +0200, Niklas Schnelle wrote:
Since commit 7b42d97e99d3 ("PCI/ERR: Always report current recovery
status for udev") AER uses the result of error_detected() as parameter
to pci_uevent_ers(). As pci_uevent_ers() however does not handle
PCI_ERS_RESULT_NEED_RESET this results in a missing uevent for the
beginning of recovery if drivers request a reset. Fix this by treating
PCI_ERS_RESULT_NEED_RESET as beginning recovery.
[...]
+++ b/drivers/pci/pci-driver.c
@@ -1592,6 +1592,7 @@ void pci_uevent_ers(struct pci_dev *pdev, enum
pci_ers_result err_type)
switch (err_type) {
case PCI_ERS_RESULT_NONE:
case PCI_ERS_RESULT_CAN_RECOVER:
+ case PCI_ERS_RESULT_NEED_RESET:
envp[idx++] = "ERROR_EVENT=BEGIN_RECOVERY";
envp[idx++] = "DEVICE_ONLINE=0";
break;
I note that PCI_ERS_RESULT_NO_AER_DRIVER is also missing in that
switch/case statement. I guess for the patch to be complete,
it needs to be added to the PCI_ERS_RESULT_DISCONNECT case.
Do you agree?
I realize now there's a bigger problem here: In pcie_do_recovery(),
when control reaches the "failed:" label, a uevent is only signaled
for the *bridge*. Shouldn't a uevent instead be signaled for every
device *below* the bridge? (And possibly the bridge itself if it was
the device reporting the error.)
The small patch below should resolve this issue.
Please let me know what you think.
In that case you don't need to add PCI_ERS_RESULT_NO_AER_DRIVER to
the switch/case statement because we wouldn't want to have multiple
uevents reporting disconnect, so the one emitted below the "failed:"
label would be sufficient.
I'll send a separate Reviewed-by for your original patch as the small
patch below should resolve my concern about PCI_ERS_RESULT_NO_AER_DRIVER.
This all looks so broken that I'm starting to wonder if there's any
user space application at all that takes advantage of these uevents?
I'd still be interested to know which user space application you're
using to track these uevents?
Thanks,
Lukas
-- >8 --
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index e795e5ae..3a95aa2 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,6 +165,12 @@ static int report_resume(struct pci_dev *dev, void *data)
return 0;
}
+static int report_disconnect(struct pci_dev *dev, void *data)
+{
+ pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
+ return 0;
+}
Since you are notifying the user space, I am wondering whether the drivers
should be notified about the recovery failure?
+
/**
* pci_walk_bridge - walk bridges potentially AER affected
* @bridge: bridge which may be a Port, an RCEC, or an RCiEP
@@ -272,7 +278,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
failed:
pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
- pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);
+ pci_walk_bridge(bridge, report_disconnect, NULL);
pci_info(bridge, "device recovery failed\n");
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer