AER: Allow drivers to opt in to Bus Reset on Non-Fatal Errors

Lukas Wunner Sun, 17 Aug 2025 06:18:13 -0700

On Thu, Aug 14, 2025 at 12:29:25PM -0700, Sathyanarayanan Kuppuswamy wrote:
> On 8/14/25 2:36 AM, Lukas Wunner wrote:
> > On Thu, Aug 14, 2025 at 09:56:09AM +0200, Niklas Schnelle wrote:
> > > On Wed, 2025-08-13 at 07:11 +0200, Lukas Wunner wrote:
> > > > @@ -233,6 +228,14 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev 
> > > > *dev,
> > > >                 pci_walk_bridge(bridge, report_mmio_enabled, &status);
> > > >         }
> > > > +       if (status == PCI_ERS_RESULT_NEED_RESET ||
> > > > +           state == pci_channel_io_frozen) {
> > > > +               if (reset_subordinates(bridge) != 
> > > > PCI_ERS_RESULT_RECOVERED) {
> > > > +                       pci_warn(bridge, "subordinate device reset 
> > > > failed\n");
> > > > +                       goto failed;
> > > > +               }
> > > > +       }
> > > > +
> > > >         if (status == PCI_ERS_RESULT_NEED_RESET) {
> > > >                 /*
> > > >                  * TODO: Should call platform-specific
> > > 
> > > I wonder if it might make sense to merge the reset into the above
> > > existing if.
> > 
> > There are drivers such as drivers/bus/mhi/host/pci_generic.c which
> > return PCI_ERS_RESULT_RECOVERED from ->error_detected().  So they
> > fall through directly to the ->resume() stage.  They're doing this
> > even in the pci_channel_io_frozen case (i.e. for Fatal Errors).
> > 
> > But for DPC we must call reset_subordinates() to bring the link back up.
> > And for Fatal Errors, Documentation/PCI/pcieaer-howto.rst suggests that
> > we must likewise call it because the link may be unreliable.
> 
> For fatal errors, since we already ignore the value returned by
> ->error_detected() (by calling reset_subordinates() unconditionally), why
> not update status accordingly in report_frozen_detected() and notify the
> driver about the reset?
> 
> That way, the reset logic could be unified under a single if
> (status == PCI_ERS_RESULT_NEED_RESET) condition.
> 
> Checking the drivers/bus/mhi/host/pci_generic.c implementation, it looks
> like calling slot_reset callback looks harmless.


Unfortunately it's not harmless:

mhi_pci_slot_reset() calls pci_enable_device().  But a corresponding
call to pci_disable_device() is only performed before in
mhi_pci_error_detected() if that function returns
PCI_ERS_RESULT_NEED_RESET.

So there would be an enable_cnt imbalance if I'd change the logic to
overwrite the driver's vote with PCI_ERS_RESULT_NEED_RESET in the
pci_channel_io_frozen case and call its ->slot_reset() callback.

The approach taken by this patch is to minimize risk, avoid any changes
to drivers, make do with minimal changes to pcie_do_recovery() and
limit the behavioral change.

I think overriding status = PCI_ERS_RESULT_NEED_RESET and calling drivers'
->slot_reset() would have to be done in a separate patch on top and would
require going through all drivers again to see which ones need to be
amended.

Also, note that report_frozen_detected() is too early to set
"status = PCI_ERS_RESULT_NEED_RESET".  That needs to happen after the
->mmio_enabled() step, so that drivers get a chance to examine the
device even in the pci_channel_io_frozen case before a reset is
performed.  (The ->mmio_enabled() step is only performed if "status" is
PCI_ERS_RESULT_CAN_RECOVER.)

So then the code would look like this:

        if (state == pci_channel_io_frozen)
                status = PCI_ERS_RESULT_NEED_RESET;

        if (status == PCI_ERS_RESULT_NEED_RESET) {
                if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
                        pci_warn(bridge, "subordinate device reset failed\n");
                        goto failed;
                }

                status = PCI_ERS_RESULT_RECOVERED;
                pci_dbg(bridge, "broadcast slot_reset message\n");
                pci_walk_bridge(bridge, report_slot_reset, &status);
        }

... which isn't very different from the present patch:

        if (status == PCI_ERS_RESULT_NEED_RESET ||
            state == pci_channel_io_frozen) {
                if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
                        pci_warn(bridge, "subordinate device reset failed\n");
                        goto failed;
                }
        }

        if (status == PCI_ERS_RESULT_NEED_RESET) {
                status = PCI_ERS_RESULT_RECOVERED;
                pci_dbg(bridge, "broadcast slot_reset message\n");
                pci_walk_bridge(bridge, report_slot_reset, &status);
        }

... except that this patch avoids touching any drivers.

Thanks,

Lukas

Re: [PATCH 1/5] PCI/AER: Allow drivers to opt in to Bus Reset on Non-Fatal Errors

Reply via email to