https://bugs.kde.org/show_bug.cgi?id=510992

nyanpasu64 <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #26 from nyanpasu64 <[email protected]> ---
For some added information, I was encountering likely the same issue on my
computers. Since I have past experience debugging Linux suspend issues, I
decided to debug the code to find what went wrong (in hindsight a rabbit hole
deeper than I could have imagined).

After I moved my Arch Linux SSD from a Ryzen desktop into an Intel NUC, Plasma
would fail to sleep from the lock screen (if I left the computer idle or
clicked Sleep myself). It would successfully sleep from the start menu (which
matches the "good sleep with Kickoff menu" reported here). I did not try
sleeping from Ctrl+Alt+Del.

## Tracing the kernel

On a failed sleep, I got the "Some devices failed to suspend, or early wake
event detected" error message. Based on past knowledge about debugging system
suspend, I ran the following commands in a root console:
echo 1 > /sys/power/pm_print_times
echo 1 > /sys/power/pm_debug_messages

Afterwards, a failed sleep produced `journalctl --system` messages:
Oct 24 22:09:30 ryzen kernel: PM: Wakeup pending, aborting suspend
Oct 24 22:09:30 ryzen kernel: PM: active wakeup source: mmc0
Oct 24 22:09:30 ryzen kernel: PM: suspend of devices aborted after 151.615
msecs
Oct 24 22:09:30 ryzen kernel: PM: start suspend of devices aborted after
169.797 msecs
Oct 24 22:09:30 ryzen kernel: PM: Some devices failed to suspend, or early wake
event detected

The "Wakeup pending, aborting suspend" message comes from function
`pm_wakeup_pending()`. This function checks if event checks are enabled, and if
some counters have changed aborts suspend and calls
`pm_print_active_wakeup_sources()`, which prints `wakeup_sources`. I don't know
what changed the counters, but instead looked into what wrote to
wakeup_sources. I found that `pm_wakeup_ws_event()` would activate an event and
`wakeup_source_register() → wakeup_source_add()` would add a new one.

- While writing this comment, I decided to trace `pm_wakeup_pending`. It was
called so many times it overflowed my terminal's scrollback! I tried running it
again, but plasmashell hung, kscreenlocker_greet froze freeze at a black
screen, then *kwin* itself hung, I had to enable raw input with SysRq-R to open
a TTY, and rebooting hung with `rcu_tasks_wait_gp` counting up endlessly. Yay
Linux...

To find who changed wakeup events, I used my stacksnoop fork at
https://github.com/nyanpasu64/bcc/blob/local/examples/tracing/stacksnoop.py to
trace a failed suspend:
nyanpasu64@ryzen ~/code/bcc (local)> sudo ./examples/tracing/stacksnoop.py
pm_wakeup_ws_event wakeup_source_register
TIME(s)            FUNCTION
5.660198689:
0: ret_from_fork_asm [kernel]
 1: ret_from_fork [kernel]
  2: kthread [kernel]
   3: worker_thread [kernel]
    4: process_one_work [kernel]
     5: async_run_entry_fn [kernel]
      6: async_suspend [kernel]
       7: device_suspend [kernel]
        8: dpm_run_callback [kernel]
         9: pci_pm_suspend [kernel]
          10: __pm_runtime_resume [kernel]
           11: rpm_resume [kernel]
            12: rpm_callback [kernel]
             13: __rpm_callback [kernel]
              14: rtsx_pci_runtime_resume [rtsx_pci]
               15: mmc_detect_change [mmc_core]
                16: pm_wakeup_ws_event [kernel]

Reading the source, the problem is that `pci_pm_suspend()` wakes PCI(e) devices
before sending them into a full sleep state, but in the process,
`_mmc_detect_change()` will "Prevent system sleep for 5s to allow user space to
consume the\n corresponding uevent"... which interrupts a system sleep in
progress!

## Debugging userspace `wakeup_count`

One oddity I found was that even when I didn't get a failed sleep, stacksnoop
said that `mmc_detect_change() → pm_wakeup_ws_event()` was still being called!
This bug thread provides a possible solution to this last mystery; the kernel
doesn't abort sleep from a wakeup event unless event checks are enabled, and
perhaps it's only true when Plasma is "writing wakeup count file".

The docs for `/sys/power/wakeup_count`
(https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-power#:~:text=What:%09%09/sys/power/wakeup_count)
say the file explicitly enables "taking into account the concurrent arrival of
wakeup events". And I've found broken code in the kernel sending wakeup events
during suspend. So when Plasma 6.5.0 enabled wakeup checks, it must've caused
these events to abort suspend!

To test with this file disabled, I ran `sudo chmod -x
/usr/libexec/kf6/kauth/wakeupsourcehelper` on a fresh Fedora boot. I tried
sleeping from Win+L four times, the first and third with +x and the second and
fourth with -x. The journal log "Some devices failed to suspend" only appeared
with +x, suggesting this workaround is necessary and sufficient to fix failed
suspends.

I think sleeping from the start menu doesn't fail, because `wakeupsourcehelper`
only runs when sleeping from the lock screen:

- I started `journalctl --follow` and slept from the lock screen (which fades
to black and turns off the screen). The journal contained a message:
`systemd[1]: Started
dbus-:[email protected].` followed by "Some
devices failed to suspend". Neither message appeared when sleeping from the
start menu.
    - Is it an oversight that `wakeupsourcehelper` is never run by the start
menu?
https://invent.kde.org/plasma/powerdevil/-/blob/master/daemon/actions/bundled/suspendsession.cpp
unconditionally calls "org.kde.powerdevil.wakeupsourcehelper" for sleep and
hibernate, but we don't see it running.

## Workarounds

- On Arch Linux, blacklisting `rtsx_pci_sdmmc` fixed system sleep, at the cost
of breaking the SD card reader.
- On Fedora, I added file `/etc/systemd/sleep.conf.d/mysleep.conf` with
contents `[Sleep]\nSuspendState=mem mem`. This retries a failed suspend, and
the second attempt will work.

## Next steps

To identify the kernel drivers at fault, I strongly recommend that people on
this bug thread run:

echo 1 > /sys/power/pm_print_times
echo 1 > /sys/power/pm_debug_messages

then attempt a failed suspend (hopefully it doesn't crash), and see which
devices show up in `journalctl --system -b0` or `dmesg` under "active wakeup
source".
Then to identify the exact stack traces, you can also run `stacksnoop.py
pm_wakeup_ws_event wakeup_source_register` with my bcc repo (or use upstream
https://github.com/iovisor/bcc if you don't trust my code).

However once the patch is merged and disables `/sys/power/wakeup_count` in
Plasma 6.5.3, I don't know how others would test what hardware is at fault. I
think I'd need a powerdevil maintainer to suggest a course of action.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to