On Thu, Aug 14, 2025 at 12:29:25PM -0700, Sathyanarayanan Kuppuswamy wrote: > On 8/14/25 2:36 AM, Lukas Wunner wrote: > > On Thu, Aug 14, 2025 at 09:56:09AM +0200, Niklas Schnelle wrote: > > > On Wed, 2025-08-13 at 07:11 +0200, Lukas Wunner wrote: > > > > @@ -233,6 +228,14 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev > > > > *dev, > > > > pci_walk_bridge(bridge, report_mmio_enabled, &status); > > > > } > > > > + if (status == PCI_ERS_RESULT_NEED_RESET || > > > > + state == pci_channel_io_frozen) { > > > > + if (reset_subordinates(bridge) != > > > > PCI_ERS_RESULT_RECOVERED) { > > > > + pci_warn(bridge, "subordinate device reset > > > > failed\n"); > > > > + goto failed; > > > > + } > > > > + } > > > > + > > > > if (status == PCI_ERS_RESULT_NEED_RESET) { > > > > /* > > > > * TODO: Should call platform-specific > > > > > > I wonder if it might make sense to merge the reset into the above > > > existing if. > > > > There are drivers such as drivers/bus/mhi/host/pci_generic.c which > > return PCI_ERS_RESULT_RECOVERED from ->error_detected(). So they > > fall through directly to the ->resume() stage. They're doing this > > even in the pci_channel_io_frozen case (i.e. for Fatal Errors). > > > > But for DPC we must call reset_subordinates() to bring the link back up. > > And for Fatal Errors, Documentation/PCI/pcieaer-howto.rst suggests that > > we must likewise call it because the link may be unreliable. > > For fatal errors, since we already ignore the value returned by > ->error_detected() (by calling reset_subordinates() unconditionally), why > not update status accordingly in report_frozen_detected() and notify the > driver about the reset? > > That way, the reset logic could be unified under a single if > (status == PCI_ERS_RESULT_NEED_RESET) condition. > > Checking the drivers/bus/mhi/host/pci_generic.c implementation, it looks > like calling slot_reset callback looks harmless.
Unfortunately it's not harmless: mhi_pci_slot_reset() calls pci_enable_device(). But a corresponding call to pci_disable_device() is only performed before in mhi_pci_error_detected() if that function returns PCI_ERS_RESULT_NEED_RESET. So there would be an enable_cnt imbalance if I'd change the logic to overwrite the driver's vote with PCI_ERS_RESULT_NEED_RESET in the pci_channel_io_frozen case and call its ->slot_reset() callback. The approach taken by this patch is to minimize risk, avoid any changes to drivers, make do with minimal changes to pcie_do_recovery() and limit the behavioral change. I think overriding status = PCI_ERS_RESULT_NEED_RESET and calling drivers' ->slot_reset() would have to be done in a separate patch on top and would require going through all drivers again to see which ones need to be amended. Also, note that report_frozen_detected() is too early to set "status = PCI_ERS_RESULT_NEED_RESET". That needs to happen after the ->mmio_enabled() step, so that drivers get a chance to examine the device even in the pci_channel_io_frozen case before a reset is performed. (The ->mmio_enabled() step is only performed if "status" is PCI_ERS_RESULT_CAN_RECOVER.) So then the code would look like this: if (state == pci_channel_io_frozen) status = PCI_ERS_RESULT_NEED_RESET; if (status == PCI_ERS_RESULT_NEED_RESET) { if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) { pci_warn(bridge, "subordinate device reset failed\n"); goto failed; } status = PCI_ERS_RESULT_RECOVERED; pci_dbg(bridge, "broadcast slot_reset message\n"); pci_walk_bridge(bridge, report_slot_reset, &status); } ... which isn't very different from the present patch: if (status == PCI_ERS_RESULT_NEED_RESET || state == pci_channel_io_frozen) { if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) { pci_warn(bridge, "subordinate device reset failed\n"); goto failed; } } if (status == PCI_ERS_RESULT_NEED_RESET) { status = PCI_ERS_RESULT_RECOVERED; pci_dbg(bridge, "broadcast slot_reset message\n"); pci_walk_bridge(bridge, report_slot_reset, &status); } ... except that this patch avoids touching any drivers. Thanks, Lukas