## Summary RADV crashes with a `[gfxhub] Page fault at address: 0x0000000000000000` when performing Vulkan rendering on an AMD RX 9060 XT (Navi 44, GFX1200). The crash occurs ~20-30 seconds into video playback in mpv using `vo=gpu-next` with `gpu-api=vulkan` (libplacebo). Multiple GPU rings (sdma0, gfx_0.0.0, comp_1.x.x) time out simultaneously. The kernel driver recovers the rings, but the Vulkan context is lost.
**Critically, the crash also occurs when video decode is offloaded to VA-API on a separate Intel iGPU** — only the Vulkan rendering path (libplacebo → RADV → `vkQueueSubmit2`) is involved. This rules out VK_KHR_video_decode_queue as the cause. ## System Information | Component | Version | |-----------|---------| | GPU | AMD Radeon RX 9060 XT — Navi 44, RDNA 4, GFX1200 [1002:7590] (rev c0) | | Mesa | 26.0.2-1 (also reproduced on 26.0.1) | | vulkan-radeon | 26.0.2-1 | | libplacebo | v7.360.0 | | Kernel | 6.19.8-zen1-1-zen | | Firmware | linux-firmware-amdgpu 20260309-1 (SMC firmware 102.70.0) | | CPU | 13th Gen Intel Core i7-1360P | | Distro | blendOS (Arch-based, rolling) | | mpv | v0.41.0, FFmpeg n8.0.1 | | Connection | eGPU via Thunderbolt 4 (Razer Core X V2), PCIe 32 GT/s x16 link | ### Module parameters ``` options amdgpu runpm=0 rebar=0 ppfeaturemask=0xFFFF7FFF ``` - `runpm=0` — runtime PM disabled (TB eGPU SMU limitation) - `rebar=0` — BIOS assigns full 16 GB BAR, driver does not resize - `ppfeaturemask=0xFFFF7FFF` — GFXOFF disabled (bit 15) due to SMU IF version mismatch (driver 0x2E vs firmware 0x33) **Note:** The SMU interface version mismatch (`smu_v14_0: SMU driver if version not matched`) is a separate known issue. GFXOFF is disabled to prevent a bus-loss crash, but the rendering crash described here is unrelated — it occurs during active rendering, not during idle. ## Steps to Reproduce 1. Install an AMD RX 9060 XT (Navi 44) 2. Configure mpv with Vulkan rendering: ``` vo=gpu-next gpu-api=vulkan gpu-context=waylandvk vulkan-device='AMD Radeon RX 9060 XT (RADV GFX1200)' vulkan-async-compute=yes vulkan-async-transfer=yes ``` 3. Play any video file: `mpv /path/to/video.mkv` 4. Wait ~20-30 seconds ### Test 1: Vulkan decode + Vulkan rendering (`hwdec=vulkan`) Crashes after ~26 seconds. ### Test 2: VA-API decode (Intel iGPU) + Vulkan rendering (`hwdec=vaapi`) **Also crashes after ~26 seconds.** VA-API decode runs on the Intel iGPU (`iHD_drv_video.so`), only Vulkan rendering runs on the AMD GPU via RADV. This isolates the bug to the RADV rendering path. ## RADV Error Output ``` radv/amdgpu: The CS has been cancelled because the context is lost. This context is guilty of a hard recovery. [vo/gpu-next/libplacebo] vkQueueSubmit2: VK_ERROR_DEVICE_LOST (../src/vulkan/command.c:514) [vo/gpu-next/libplacebo] Retrieving query pool results: VK_ERROR_DEVICE_LOST (../src/vulkan/gpu.c:105) [vo/gpu-next/libplacebo] Failed holding swapchain image for presentation [vo/gpu-next] Failed presenting frame! [ffmpeg] vk: Unable to submit command buffer: VK_ERROR_DEVICE_LOST [ffmpeg/video] h264: hardware accelerator failed to decode picture ``` ## Kernel Log (Crash 1 — hwdec=vulkan, Mesa 26.0.2) ``` amdgpu 0000:06:00.0: amdgpu: Dumping IP State amdgpu 0000:06:00.0: amdgpu: [drm] AMDGPU device coredump file has been created amdgpu 0000:06:00.0: amdgpu: ring sdma0 timeout, signaled seq=11425, emitted seq=11427 amdgpu 0000:06:00.0: amdgpu: Starting sdma0 ring reset amdgpu 0000:06:00.0: amdgpu: Ring sdma0 reset succeeded amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=16289, emitted seq=16291 amdgpu 0000:06:00.0: amdgpu: Process mpv pid 44985 thread vo pid 45004 amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.0.0 reset succeeded amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.0 timeout, signaled seq=13, emitted seq=14 amdgpu 0000:06:00.0: amdgpu: Process mpv pid 44985 thread vo pid 45004 amdgpu 0000:06:00.0: amdgpu: Ring comp_1.1.0 reset succeeded amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset amdgpu 0000:06:00.0: amdgpu: Fence fallback timer expired on ring sdma1 amdgpu 0000:06:00.0: [drm] *ERROR* [CRTC:416:crtc-0] flip_done timed out ``` ## Kernel Log (Crash 2 — hwdec=vaapi, Mesa 26.0.2) ``` amdgpu 0000:06:00.0: amdgpu: ring sdma0 timeout, signaled seq=13615, emitted seq=13617 amdgpu 0000:06:00.0: amdgpu: Ring sdma0 reset succeeded amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=30731, emitted seq=30733 amdgpu 0000:06:00.0: amdgpu: Process mpv pid 66481 thread vo pid 66500 amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.0.0 reset succeeded amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=312, emitted seq=313 ``` ## GPU Device Coredump (Crash 1) ``` **** AMDGPU Device Coredump **** version: 1 kernel: 6.19.8-zen1-1-zen module: amdgpu time: 3054.167340782 SOC Device id: 30096 SOC Family: 152 SOC External Revision id: 65 HWIP: GC[1][0]: v12.0.0.0.0 HWIP: SDMA0[3][0]: v7.0.0.0.0 HWIP: MMHUB[12][0]: v4.1.0.0.0 Ring timed out details IP Type: 2 Ring Name: sdma0 [gfxhub] Page fault observed Faulty page starting at address: 0x0000000000000000 Protection fault status register: 0x0 ``` **Full coredump available on request** (543 KB). ## Analysis - The crash is a **NULL pointer dereference at GPU virtual address 0x0** — RADV is submitting commands that reference unmapped memory. - The `Protection fault status register: 0x0` suggests the fault info itself is zeroed, which may indicate the fault occurred very early in command processing or in an SDMA copy from a NULL source. - The fault hits sdma0 first, then cascades to gfx_0.0.0 and a compute ring — consistent with a resource upload (SDMA) referencing a NULL buffer, followed by the GFX/compute rings trying to use the result. - After ring resets, the GPU fully recovers (all fences drain, PCIe link stays up at 32 GT/s x16), confirming this is a userspace (RADV) command stream issue, not a hardware or kernel driver bug. - The `flip_done timed out` on CRTC-0 is a secondary effect — the compositor's page flip can't complete while rings are being reset, which restarts the GNOME session. ## Additional Notes - The GPU is connected via Thunderbolt 4 (eGPU enclosure), but the PCIe link stays healthy through the crash — this is not a link/BAR issue. - This was also reproduced on Mesa 26.0.1 with kernel 6.19.6 and firmware 20260221 (SMC 102.69.0) — same crash signature. - Desktop compositing (GNOME Shell / Mutter on Wayland) works fine on this GPU — only mpv's libplacebo rendering pipeline triggers the crash. - `vulkan-async-compute=yes` was enabled. Not yet tested with async compute disabled, though the fault is on sdma0, not a compute ring.
