https://bugs.kde.org/show_bug.cgi?id=517436

            Bug ID: 517436
           Summary: KWIN_USE_OVERLAYS=1 on RDNA4 causing memory errors in
                    compilers and stress tests
    Classification: Plasma
           Product: kwin
      Version First 6.6.2
       Reported In:
          Platform: CachyOS
                OS: Linux
            Status: REPORTED
          Severity: normal
          Priority: NOR
         Component: general
          Assignee: [email protected]
          Reporter: [email protected]
  Target Milestone: ---

Created attachment 190584
  --> https://bugs.kde.org/attachment.cgi?id=190584&action=edit
GSAT (stressapptest ran with -W) logs

SUMMARY
KWIN_USE_OVERLAYS=1 causes CPU computation corruption (mprime/stressapptest
failures, sporadic compilation errors) on AMD RDNA 4 (RX 9070 XT). This results
in sporadic, transient compilation errors in GCC, rustc and/or LLVM.

STEPS TO REPRODUCE
1. Set KWIN_USE_OVERLAYS=1 as an environment variable for the KWin Wayland
session (e.g. in /etc/environment)
2. Log in to a Plasma Wayland session
3. Run mprime -t (Prime95 torture test) or stressapptest -W

OBSERVED RESULT
mprime immediately and consistently fails with:

FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected running 768K FFT size

stressapptest fails within seconds with memory miscompares across multiple CPUs
and test patterns (full log attached). 70 hardware incidents in 16 seconds,
hitting CPUs 1, 3, 5, 9, 10, 11, 12 across all test patterns (Long8b10b, FiveA,
walkingInvOnes, OneZero, JustOne, Checker8b10b, walkingZeros).

Compilation tasks also produce sporadic ICEs and miscompilations, observed
while building mesa and during rustc/LLVM compilation.

EXPECTED RESULT
CPU stress tests should pass. The CPU and RAM are verified healthy. mprime and
stressapptest run indefinitely without errors in all of the following
scenarios:
- KWIN_USE_OVERLAYS unset or =0 with the same Plasma Wayland session: passes
- Compositor stopped (systemctl stop plasmalogin) with amdgpu driver still
loaded: passes
- Booted to multi-user.target (no graphical session): passes
- EndeavourOS and CachyOS live USBs: passes

The only condition that triggers the error on my system is having KWin's
hardware overlay planes active. Only tested on Wayland, not X11.

SOFTWARE/OS VERSIONS
- OS: CachyOS (Arch-based)
- Kernel: 7.0.0-rc3-1-cachyos-rc (also reproduced on 6.19.6-2-cachyos and 6.19
from Arch core)
- KDE Plasma Version: 6.6.2
- KDE Frameworks Version: 6.23.0
- Qt Version: 6.10.2
- KWin: 6.6.2-1.1
- Kernel boot line: root=UUID=... rw rootflags=subvol=/@ nowatchdog
nmi_watchdog=0

ADDITIONAL INFORMATION
Hardware:
- CPU: AMD Ryzen 7 7700X (Zen 4)
- GPU: AMD Radeon RX 9070 XT (RDNA 4, gfx1201, PCI ID 1002:7550)
- RAM: 32 GB DDR5
- GPU driver: amdgpu (radeonsi, DRM 3.64)
- Full PCIe ReBAR enabled (16384 MB BAR)

Isolation methodology:
Three environment variables were active when the issue was found:
AMD_USERQ=1
KWIN_DRM_USE_MODIFIERS=1
KWIN_USE_OVERLAYS=1
Disabling all three resolved the issue. Re-enabling them one at a time
identified KWIN_USE_OVERLAYS=1 as the sole cause. The other two can be active
without issue so far.

Corruption pattern:
Memory reads back as all-zeros or stale data (0x0000000f0000ffff,
0x0000ffff0000000f) instead of the expected test patterns. The corrupted values
are consistent across re-reads (not transient bit flips), suggesting systematic
corruption rather than random bit flips. 
This pattern appears across all 8 cores/16 threads simultaneously.
Note: This may ultimately be an amdgpu kernel driver issue, possibly related to
RDNA4 architecture, triggered by KWin's overlay plane usage rather than a KWin
bug itself. Filing here since KWin's KWIN_USE_OVERLAYS is the trigger, but
happy to re-file against the kernel driver if that's more appropriate.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to