[Bug 215742] The NVME storage quirked as SIMPLE SUSPEND makes system resume failed after suspend (Regression)

bugzilla-daemon Thu, 22 Feb 2024 08:48:35 -0800

https://bugzilla.kernel.org/show_bug.cgi?id=215742


--- Comment #29 from Daniel Drake (dr...@endlessm.com) ---
The issue in that case is that all the device memory is FFFFFFFF at that point
in the test. Can check with `busybox devmem 0x82200000`.

But I think my test is questionable: unbind drivers, put parent bridge in
D3hot, then D3cold, then power bridge back up, re-enter D0, and reload the
drivers. At this point the bus seems enumerable and configuration spaces are
accessible but the downstream child device is not working right. That's perhaps
not surprising; the parent bridge was power cycled, but there wasn't any
reset/reinit done at the child device level.

I tried a more complete test:

setpci -s 01:00.0 CAP_PM+4.b=03 # D3hot
setpci -s 00:06.0 CAP_PM+4.b=03 # D3hot
./acpidbg -b "execute \_SB.PC00.PEG0.PEGP._PS3"
./acpidbg -b "execute \_SB.PC00.PEG0._PS3"
./acpidbg -b "execute \_SB.PC00.PEG0.PXP._OFF"
./acpidbg -b "execute \_SB.PC00.PEG0.PXP._ON"
./acpidbg -b "execute \_SB.PC00.PEG0._PS0"
./acpidbg -b "execute \_SB.PC00.PEG0.PEGP._PS0"
setpci -s 00:06.0 CAP_PM+4.b=0 # D0
setpci -s 01:00.0 CAP_PM+4.b=0 # D0
echo "0000:00:06.0" > /sys/bus/pci/drivers/pcieport/bind
echo "0000:01:00.0" > /sys/bus/pci/drivers/nvme/bind

That fails (01:00.0 now disappears, can't do the setpci D0 or nvme driver
bind), which may be a confirmation of this problem, or perhaps still something
not quite right in my testing.

Tried a more thorough way of testing this.

Remove this code from pci_pm_runtime_suspend():

        /*
         * If pci_dev->driver is not set (unbound), we leave the device in D0,
         * but it may go to D3cold when the bridge above it runtime suspends.
         * Save its config space in case that happens.
         */
        if (!pci_dev->driver) {
                pci_save_state(pci_dev);
                return 0;
        }

(That's needed because otherwise the driverless child device won't go
"properly" into D3cold, it'll get marked as D3cold when the parent bridge
suspends but the ACPI bits won't be executed. In this case the NVMe device and
parent bridge both have the same ACPI power resource referenced by _PR3, so
both references must be released for the problematic codepath to be hit.)

Now unbind nvme driver and enable runtime suspend on that device:

echo "0000:01:00.0" > /sys/bus/pci/drivers/nvme/unbind
echo auto > "/sys/bus/pci/devices/0000:01:00.0/power/control"

Now the NVME device and parent bridge go into D3cold properly with PXP power
resource turned off.

Power on bridge again:
echo on > "/sys/bus/pci/devices/0000:00:06.0/power/control"

Result:
 pcieport 0000:00:06.0: broken device, retraining non-functional downstream
link at 2.5GT/s
 pcieport 0000:00:06.0: retraining failed
 pcieport 0000:00:06.0: broken device, retraining non-functional downstream
link at 2.5GT/s
 pcieport 0000:00:06.0: retraining failed
 pci 0000:01:00.0: not ready 1023ms after resume; waiting
(snip)
 pci 0000:01:00.0: not ready 65535ms after resume; giving up
 pci 0000:01:00.0: Unable to change power state from D3cold to D0, device
inaccessible

The fact that the 06.0 parent bridge seems to fail early at this point might
suggest that the bridge is the thing not being resumed properly. But the pci
bridge config space is readable, the errors are about the downstream link, and
the NVMe device config space is inaccessible. So that might suggest that the
NVMe device is the thing that is not being reset properly? Also, the NVMe
device has no-op _PS3 and _PS0 and the _PR3 just points at the one power
resource from the root port. It feels like nothing is really managing the reset
of the NVMe device.

Not sure if this gets us any closer to a way of powering the devices back up
again here, or if it even really matters which of the two devices is the
culprit, disabling D3cold on either one would suffice.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are the assignee for the bug.

_______________________________________________
acpi-bugzilla mailing list
acpi-bugzilla@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/acpi-bugzilla

[Bug 215742] The NVME storage quirked as SIMPLE SUSPEND makes system resume failed after suspend (Regression)

Reply via email to