From: LeoLiu-oc <[email protected]>
Commit a97396c6eb13 ("PCI: pciehp: Ignore Link Down/Up caused by DPC")
amended PCIe hotplug to not bring down the slot upon Data Link Layer State
Changed events caused by Downstream Port Containment.
Commit c3be50f7547c ("PCI: pciehp: Ignore Presence Detect Changed caused by
DPC") sought to ignore Presence Detect Changed events occurring as a side
effect of Downstream Port Containment. These commits await recovery from
DPC and then clears events which occurred in the meantime.
However, pciehp_ist() waits up to 4 seconds before assuming that DPC
recovery has failed and disabling the slot. This timeout period is
insufficient for some PCIe devices.
For example, The execution of the ice_pci_err_detected() in the ice network
card driver exceeded the maximum waiting time for DPC recovery, causing the
pciehp_disable_slot() to be executed which is not needed. From the user's
point of view, you will see that the ice network card may not be usable and
could even cause more serious errors, such as a kernel panic. kernel panic
is caused by a race between pciehp_disable_slot() and pcie_do_recovery().
In practice, we would observe that the ice network card is in an
unavailable state and a kernel panic.
Therefore, we need to increase the time that pciehp_ist() waits for the DPC
to recover. For some PCIe devices, the time taken for the error_detected()
to execute may exceed 16 seconds, but the dpc_reset_link() has not yet been
executed. In this situation, the Link Down/Up events and Presence Detect
Changed events that occur during the DPC recovery should also be ignored.
Signed-off-by: LeoLiu-oc <[email protected]>
---
v2:
- Modify and add code comments
- Add handling for error_detected() execution exceeding 16s
v1: https://lore.kernel.org/all/[email protected]/
---
drivers/pci/pcie/dpc.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index fc18349614d7..331d0299af6a 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -103,6 +103,7 @@ static bool dpc_completed(struct pci_dev *pdev)
bool pci_dpc_recovered(struct pci_dev *pdev)
{
struct pci_host_bridge *host;
+ u16 status;
if (!pdev->dpc_cap)
return false;
@@ -118,10 +119,22 @@ bool pci_dpc_recovered(struct pci_dev *pdev)
/*
* Need a timeout in case DPC never completes due to failure of
* dpc_wait_rp_inactive(). The spec doesn't mandate a time limit,
- * but reports indicate that DPC completes within 4 seconds.
+ * but reports indicate that DPC completes within 16 seconds.
*/
wait_event_timeout(dpc_completed_waitqueue, dpc_completed(pdev),
- msecs_to_jiffies(4000));
+ msecs_to_jiffies(16000));
+
+ /*
+ * In some cases, the execution time of report_error_detected()
+ * exceeded 16 seconds, and dpc_reset_link() was still waiting to
+ * be executed. This situation should be treated as successful dpc
+ * recovery.
+ */
+ pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS, &status);
+ if ((!PCI_POSSIBLE_ERROR(status)) && (status &
PCI_EXP_DPC_STATUS_TRIGGER)) {
+ pci_warn(pdev, "DPC: error_detected() callback timed out\n");
+ return true;
+ }
return test_and_clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags);
}
--
2.43.0