https://bugs.kde.org/show_bug.cgi?id=520562

Matei Marcu <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #193126|0                           |1
        is obsolete|                            |

--- Comment #5 from Matei Marcu <[email protected]> ---
Created attachment 193127
  --> https://bugs.kde.org/attachment.cgi?id=193127&action=edit
Corrected log + comment bundle. Supersedes previous attachment.

Thanks for the review and the pointer to the crash-report guide.

**Errata to the original report:** DP-1 (Alienware AW2725DF) and DP-3 (ASUS
PG27AQWP-W) currently run in landscape orientation. The original report's
"portrait" framing reflected a temporary rotation experiment carried out during
local diagnostics; physical rotation turned out not to be load-bearing for this
failure (the EINVAL cascade fires identically with the outputs in either
orientation, and DRM mode strings are always reported in native-landscape
resolution regardless of plane rotation).

A note on the artifact shape: this is a configuration-loop bug rather than a
SIGSEGV, so `DrKonqi` does not fire, `coredumpctl` has no entry, and a
postmortem backtrace from a coredump is not available. In place of that, this
capture pairs `strace -e ioctl` on `kwin_wayland` (which records every
`DRM_IOCTL_MODE_ATOMIC` call during the failure cascade) with the per-pipeline
diagnostic output from a locally patched KWin build, plus a symbolic `gdb
thread apply all bt full` captured shortly after KWin gives up retrying (shows
the post-failure idle state of all seven KWin threads with full source line
resolution via debuginfod). The narrative is in the inline summary below; the
bulk artifacts are attached.

## Updated failure description (sharper than the original report)

The original report described the symptom as a "wedge". With the new strace
capture the failure model is more precise: KWin's `KWin::DrmGpu::testPipelines`
(`src/backends/drm/drm_gpu.cpp:484`) calls
`commitPipelines(CommitMode::TestAllowModeset)` with all three outputs bundled
into a single atomic test commit. The kernel returns `EINVAL`.
`KWin::DrmGpu::checkCrtcAssignment` (`drm_gpu.cpp:355`) then retries with a
different CRTC-to-connector permutation. Every permutation returns `EINVAL`.
KWin exhausts the search after 168 atomic test commits in ~4.2 seconds and
disables DP-2 (and on some runs cascades to disabling all outputs).

The kernel is not deadlocked. The `nvidia-modeset` driver returns `EINVAL`
synchronously from each `DRM_IOCTL_MODE_ATOMIC` call; no thread is stuck in an
ioctl. The earlier report's "hung-task watchdog" framing was likely incorrect —
KWin's threads in the present capture are not stuck in any ioctl; they sleep
normally in `pthread_cond_wait` after the recursive search gives up. A separate
`v4l2_open` BUG_ON in the out-of-tree `v4l2loopback` module was observed in
`dmesg` during this investigation and may have been the actual source of the
watchdog dumps in the original report; this is not asserted with full
confidence.

## Inline summary of the strace capture (the actual gold)

>From `strace -f -tt -v -e trace=ioctl` on `kwin_wayland` during a single
trigger event:

```
total DRM_IOCTL_MODE_ATOMIC calls : 475
returns:
  0     307     (succeeded)
  -1    168     (EINVAL)

flags distribution:
  ATOMIC_TEST_ONLY|ATOMIC_ALLOW_MODESET    168    <-- ALL FAIL with EINVAL
  ATOMIC_TEST_ONLY|ATOMIC_NONBLOCK         156    <-- all succeed
  PAGE_FLIP_EVENT|ATOMIC_NONBLOCK          151    <-- all succeed

count_objs distribution:
   2 objects     145 calls    (page flips, succeed)
   3 objects     162 calls    (single-output tests, succeed)
  21 objects     168 calls    (three-output bundled tests, all EINVAL)
```

Interpretation:

- Single-output atomic tests (`count_objs=3`, one CRTC + one connector + one
plane) pass throughout.
- Page-flip submissions (`count_objs=2`) on individual outputs pass throughout.
- Only the combined three-output test commit (`count_objs=21`) under
`ATOMIC_ALLOW_MODESET` fails, and it fails every time, 168 in a row.
- The 168 figure is the call count of `commit->testAllowModeset()` (which
performs `DRM_IOCTL_MODE_ATOMIC` with `TEST_ONLY | ALLOW_MODESET`) inside
`checkCrtcAssignment`'s recursive CRTC-to-connector search. The four distinct
CRTC IDs cycled through are 62, 81, 100, 119. The exact derivation of the 168
number from the recursion depth is not asserted here.

The patched diagnostic build prints the connector/CRTC/mode triple for every
pipeline in every failed attempt; that output is 504 lines (= 168 × 3) and is
in `journal-user.txt` in the attachment bundle. A representative window:

```
pipeline connector="PG27AQWP-W"    crtc=62  mode=2560x1440@540Hz  
needsModeset=0
pipeline connector="AW2725DF"      crtc=81  mode=2560x1440@360Hz  
needsModeset=0
pipeline connector="BenQ PD3200U"  crtc=100 mode=3840x2160@60Hz   
needsModeset=0
Atomic modeset test failed! Invalid argument
pipeline connector="PG27AQWP-W"    crtc=62  mode=2560x1440@540Hz  
needsModeset=0
pipeline connector="AW2725DF"      crtc=81  mode=2560x1440@360Hz  
needsModeset=0
pipeline connector="BenQ PD3200U"  crtc=119 mode=3840x2160@60Hz   
needsModeset=0
Atomic modeset test failed! Invalid argument
... 166 more attempts permuting BenQ across CRTCs 62, 81, 100, 119 ...
```

The BenQ PD3200U is requested at 3840x2160@60 Hz on every attempt. The other
two outputs (ASUS PG27AQWP-W at 2560x1440@540 Hz and Alienware AW2725DF at
2560x1440@360 Hz) keep their modes. KWin reshuffles CRTC assignments across 168
permutations; the kernel rejects each one.

This narrows the question to nvidia-modeset's atomic-commit validation path for
`DRM_MODE_ATOMIC_ALLOW_MODESET` with three concurrent outputs when the BenQ's
4K60 timing is in the set. Single-output and page-flip commits validate fine;
the combined three-output modeset test is the only commit shape that rejects.

## Userspace context (per the KDE crash-report guide's KWin-specific section)

- Compositing: enabled. No `Backend=` override in `~/.config/kwinrc`; KWin
6.6.4 default applies (OpenGL).
- Effects: no `[Plugins]` section in `~/.config/kwinrc`; defaults apply for
Plasma 6.6.5.
- Decorations: Aurorae library (`org.kde.kwin.aurorae.v2`), theme
`__aurorae__svg__Fluent-round-dark`. Not Breeze; noted because window
decoration plumbing is not in the failing atomic property set, but stating
accurately for the record.
- Drivers: nvidia-beta-dkms 595.71.05 (closed). The bug also reproduces with
nvidia-open-dkms 595.71.05 with both an EINVAL atomic-test failure and an
additional EGLImage `GL_INVALID_OPERATION` cascade.
- Kernel: 7.0.9-zen2-1-zen
- KWin: 6.6.4 with a single local diagnostic patch at
`src/backends/drm/drm_pipeline.cpp:138-149` that emits `qCWarning(KWIN_DRM)`
per-pipeline state on `testAllowModeset()` failure. The patched binary is at
`~/kwin-src/install/bin/kwin_wayland`; `kwin-id.txt` records the path and `kwin
--version` output (`kwin 6.6.4`).

## Attachments

- `strace-ioctl.txt` — strace with full ioctl decode, captures the 168 EINVAL
events
- `atomic-decoded.txt` — Python-decoded human-readable summary of every DRM
atomic ioctl
- `journal-user.txt` — patched-build per-pipeline diagnostic output (504 lines)
- `journal-kernel.txt` — kernel side, includes SysRq +w / +t task dumps
- `dmesg.txt` — kernel ring buffer through the burst
- `gdb-bt-postwedge.txt` — symbolic `thread apply all bt full` captured shortly
after the failure cascade ended; seven KWin threads, post-failure idle state,
full source line resolution via debuginfod
- `wchan.txt` — per-thread kernel wait-channel
- `kwin-id.txt` — binary identity, version, command-line
- `modules-env.txt` — kernel cmdline, nvidia version, relevant lsmod entries
- `kmsg-burst.txt` — /dev/kmsg readout bracketing the failure window

Happy to capture anything else that would help — including `gdb` set up to
break on `drmModeAtomicCommit` if the property/value array of a failing call
would be useful, or an `ftrace` of the `drm:` event family during the burst.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to