On Tue, Mar 17, 2026 at 4:24 AM Cristian Cocos <[email protected]> wrote:
>
> ## Summary
>
> RADV crashes with a `[gfxhub] Page fault at address: 0x0000000000000000` when 
> performing Vulkan rendering on an AMD RX 9060 XT (Navi 44, GFX1200). The 
> crash occurs ~20-30 seconds into video playback in mpv using `vo=gpu-next` 
> with `gpu-api=vulkan` (libplacebo). Multiple GPU rings (sdma0, gfx_0.0.0, 
> comp_1.x.x) time out simultaneously. The kernel driver recovers the rings, 
> but the Vulkan context is lost.
>
> **Critically, the crash also occurs when video decode is offloaded to VA-API 
> on a separate Intel iGPU** — only the Vulkan rendering path (libplacebo → 
> RADV → `vkQueueSubmit2`) is involved. This rules out 
> VK_KHR_video_decode_queue as the cause.
>

Please file a mesa ticket:
https://gitlab.freedesktop.org/mesa/mesa/-/issues
And include your full dmesg output from boot to when the issue happens.

Alex

> ## System Information
>
> | Component | Version |
> |-----------|---------|
> | GPU | AMD Radeon RX 9060 XT — Navi 44, RDNA 4, GFX1200 [1002:7590] (rev c0) 
> |
> | Mesa | 26.0.2-1 (also reproduced on 26.0.1) |
> | vulkan-radeon | 26.0.2-1 |
> | libplacebo | v7.360.0 |
> | Kernel | 6.19.8-zen1-1-zen |
> | Firmware | linux-firmware-amdgpu 20260309-1 (SMC firmware 102.70.0) |
> | CPU | 13th Gen Intel Core i7-1360P |
> | Distro | blendOS (Arch-based, rolling) |
> | mpv | v0.41.0, FFmpeg n8.0.1 |
> | Connection | eGPU via Thunderbolt 4 (Razer Core X V2), PCIe 32 GT/s x16 
> link |
>
> ### Module parameters
>
> ```
> options amdgpu runpm=0 rebar=0 ppfeaturemask=0xFFFF7FFF
> ```
>
> - `runpm=0` — runtime PM disabled (TB eGPU SMU limitation)
> - `rebar=0` — BIOS assigns full 16 GB BAR, driver does not resize
> - `ppfeaturemask=0xFFFF7FFF` — GFXOFF disabled (bit 15) due to SMU IF version 
> mismatch (driver 0x2E vs firmware 0x33)
>
> **Note:** The SMU interface version mismatch (`smu_v14_0: SMU driver if 
> version not matched`) is a separate known issue. GFXOFF is disabled to 
> prevent a bus-loss crash, but the rendering crash described here is unrelated 
> — it occurs during active rendering, not during idle.
>
> ## Steps to Reproduce
>
> 1. Install an AMD RX 9060 XT (Navi 44)
> 2. Configure mpv with Vulkan rendering:
> ```
> vo=gpu-next
> gpu-api=vulkan
> gpu-context=waylandvk
> vulkan-device='AMD Radeon RX 9060 XT (RADV GFX1200)'
> vulkan-async-compute=yes
> vulkan-async-transfer=yes
> ```
> 3. Play any video file: `mpv /path/to/video.mkv`
> 4. Wait ~20-30 seconds
>
> ### Test 1: Vulkan decode + Vulkan rendering (`hwdec=vulkan`)
>
> Crashes after ~26 seconds.
>
> ### Test 2: VA-API decode (Intel iGPU) + Vulkan rendering (`hwdec=vaapi`)
>
> **Also crashes after ~26 seconds.** VA-API decode runs on the Intel iGPU 
> (`iHD_drv_video.so`), only Vulkan rendering runs on the AMD GPU via RADV. 
> This isolates the bug to the RADV rendering path.
>
> ## RADV Error Output
>
> ```
> radv/amdgpu: The CS has been cancelled because the context is lost.
> This context is guilty of a hard recovery.
>
> [vo/gpu-next/libplacebo] vkQueueSubmit2: VK_ERROR_DEVICE_LOST 
> (../src/vulkan/command.c:514)
> [vo/gpu-next/libplacebo] Retrieving query pool results: VK_ERROR_DEVICE_LOST 
> (../src/vulkan/gpu.c:105)
> [vo/gpu-next/libplacebo] Failed holding swapchain image for presentation
> [vo/gpu-next] Failed presenting frame!
> [ffmpeg] vk: Unable to submit command buffer: VK_ERROR_DEVICE_LOST
> [ffmpeg/video] h264: hardware accelerator failed to decode picture
> ```
>
> ## Kernel Log (Crash 1 — hwdec=vulkan, Mesa 26.0.2)
>
> ```
> amdgpu 0000:06:00.0: amdgpu: Dumping IP State
> amdgpu 0000:06:00.0: amdgpu: [drm] AMDGPU device coredump file has been 
> created
> amdgpu 0000:06:00.0: amdgpu: ring sdma0 timeout, signaled seq=11425, emitted 
> seq=11427
> amdgpu 0000:06:00.0: amdgpu: Starting sdma0 ring reset
> amdgpu 0000:06:00.0: amdgpu: Ring sdma0 reset succeeded
> amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset
> amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=16289, 
> emitted seq=16291
> amdgpu 0000:06:00.0: amdgpu: Process mpv pid 44985 thread vo pid 45004
> amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.0.0 reset succeeded
> amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset
> amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.0 timeout, signaled seq=13, 
> emitted seq=14
> amdgpu 0000:06:00.0: amdgpu: Process mpv pid 44985 thread vo pid 45004
> amdgpu 0000:06:00.0: amdgpu: Ring comp_1.1.0 reset succeeded
> amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset
> amdgpu 0000:06:00.0: amdgpu: Fence fallback timer expired on ring sdma1
> amdgpu 0000:06:00.0: [drm] *ERROR* [CRTC:416:crtc-0] flip_done timed out
> ```
>
> ## Kernel Log (Crash 2 — hwdec=vaapi, Mesa 26.0.2)
>
> ```
> amdgpu 0000:06:00.0: amdgpu: ring sdma0 timeout, signaled seq=13615, emitted 
> seq=13617
> amdgpu 0000:06:00.0: amdgpu: Ring sdma0 reset succeeded
> amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset
> amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=30731, 
> emitted seq=30733
> amdgpu 0000:06:00.0: amdgpu: Process mpv pid 66481 thread vo pid 66500
> amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.0.0 reset succeeded
> amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset
> amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=312, 
> emitted seq=313
> ```
>
> ## GPU Device Coredump (Crash 1)
>
> ```
> **** AMDGPU Device Coredump ****
> version: 1
> kernel: 6.19.8-zen1-1-zen
> module: amdgpu
> time: 3054.167340782
>
> SOC Device id: 30096
> SOC Family: 152
> SOC External Revision id: 65
>
> HWIP: GC[1][0]: v12.0.0.0.0
> HWIP: SDMA0[3][0]: v7.0.0.0.0
> HWIP: MMHUB[12][0]: v4.1.0.0.0
>
> Ring timed out details
> IP Type: 2 Ring Name: sdma0
>
> [gfxhub] Page fault observed
> Faulty page starting at address: 0x0000000000000000
> Protection fault status register: 0x0
> ```
>
> **Full coredump available on request** (543 KB).
>
> ## Analysis
>
> - The crash is a **NULL pointer dereference at GPU virtual address 0x0** — 
> RADV is submitting commands that reference unmapped memory.
> - The `Protection fault status register: 0x0` suggests the fault info itself 
> is zeroed, which may indicate the fault occurred very early in command 
> processing or in an SDMA copy from a NULL source.
> - The fault hits sdma0 first, then cascades to gfx_0.0.0 and a compute ring — 
> consistent with a resource upload (SDMA) referencing a NULL buffer, followed 
> by the GFX/compute rings trying to use the result.
> - After ring resets, the GPU fully recovers (all fences drain, PCIe link 
> stays up at 32 GT/s x16), confirming this is a userspace (RADV) command 
> stream issue, not a hardware or kernel driver bug.
> - The `flip_done timed out` on CRTC-0 is a secondary effect — the 
> compositor's page flip can't complete while rings are being reset, which 
> restarts the GNOME session.
>
> ## Additional Notes
>
> - The GPU is connected via Thunderbolt 4 (eGPU enclosure), but the PCIe link 
> stays healthy through the crash — this is not a link/BAR issue.
> - This was also reproduced on Mesa 26.0.1 with kernel 6.19.6 and firmware 
> 20260221 (SMC 102.69.0) — same crash signature.
> - Desktop compositing (GNOME Shell / Mutter on Wayland) works fine on this 
> GPU — only mpv's libplacebo rendering pipeline triggers the crash.
> - `vulkan-async-compute=yes` was enabled. Not yet tested with async compute 
> disabled, though the fault is on sdma0, not a compute ring.

Reply via email to