## Summary

RADV crashes with a `[gfxhub] Page fault at address:
0x0000000000000000` when performing Vulkan rendering on an AMD RX 9060
XT (Navi 44, GFX1200). The crash occurs ~20-30 seconds into video
playback in mpv using `vo=gpu-next` with `gpu-api=vulkan` (libplacebo).
Multiple GPU rings (sdma0, gfx_0.0.0, comp_1.x.x) time out
simultaneously. The kernel driver recovers the rings, but the Vulkan
context is lost.

**Critically, the crash also occurs when video decode is offloaded to
VA-API on a separate Intel iGPU** — only the Vulkan rendering path
(libplacebo → RADV → `vkQueueSubmit2`) is involved. This rules out
VK_KHR_video_decode_queue as the cause.

## System Information

| Component | Version |
|-----------|---------|
| GPU | AMD Radeon RX 9060 XT — Navi 44, RDNA 4, GFX1200 [1002:7590]
(rev c0) |
| Mesa | 26.0.2-1 (also reproduced on 26.0.1) |
| vulkan-radeon | 26.0.2-1 |
| libplacebo | v7.360.0 |
| Kernel | 6.19.8-zen1-1-zen |
| Firmware | linux-firmware-amdgpu 20260309-1 (SMC firmware 102.70.0) |
| CPU | 13th Gen Intel Core i7-1360P |
| Distro | blendOS (Arch-based, rolling) |
| mpv | v0.41.0, FFmpeg n8.0.1 |
| Connection | eGPU via Thunderbolt 4 (Razer Core X V2), PCIe 32 GT/s
x16 link |

### Module parameters

```
options amdgpu runpm=0 rebar=0 ppfeaturemask=0xFFFF7FFF
```

- `runpm=0` — runtime PM disabled (TB eGPU SMU limitation)
- `rebar=0` — BIOS assigns full 16 GB BAR, driver does not resize
- `ppfeaturemask=0xFFFF7FFF` — GFXOFF disabled (bit 15) due to SMU IF
version mismatch (driver 0x2E vs firmware 0x33)

**Note:** The SMU interface version mismatch (`smu_v14_0: SMU driver if
version not matched`) is a separate known issue. GFXOFF is disabled to
prevent a bus-loss crash, but the rendering crash described here is
unrelated — it occurs during active rendering, not during idle.

## Steps to Reproduce

1. Install an AMD RX 9060 XT (Navi 44)
2. Configure mpv with Vulkan rendering:
```
vo=gpu-next
gpu-api=vulkan
gpu-context=waylandvk
vulkan-device='AMD Radeon RX 9060 XT (RADV GFX1200)'
vulkan-async-compute=yes
vulkan-async-transfer=yes
```
3. Play any video file: `mpv /path/to/video.mkv`
4. Wait ~20-30 seconds

### Test 1: Vulkan decode + Vulkan rendering (`hwdec=vulkan`)

Crashes after ~26 seconds.

### Test 2: VA-API decode (Intel iGPU) + Vulkan rendering
(`hwdec=vaapi`)

**Also crashes after ~26 seconds.** VA-API decode runs on the Intel
iGPU (`iHD_drv_video.so`), only Vulkan rendering runs on the AMD GPU
via RADV. This isolates the bug to the RADV rendering path.

## RADV Error Output

```
radv/amdgpu: The CS has been cancelled because the context is lost.
This context is guilty of a hard recovery.

[vo/gpu-next/libplacebo] vkQueueSubmit2: VK_ERROR_DEVICE_LOST
(../src/vulkan/command.c:514)
[vo/gpu-next/libplacebo] Retrieving query pool results:
VK_ERROR_DEVICE_LOST (../src/vulkan/gpu.c:105)
[vo/gpu-next/libplacebo] Failed holding swapchain image for
presentation
[vo/gpu-next] Failed presenting frame!
[ffmpeg] vk: Unable to submit command buffer: VK_ERROR_DEVICE_LOST
[ffmpeg/video] h264: hardware accelerator failed to decode picture
```

## Kernel Log (Crash 1 — hwdec=vulkan, Mesa 26.0.2)

```
amdgpu 0000:06:00.0: amdgpu: Dumping IP State
amdgpu 0000:06:00.0: amdgpu: [drm] AMDGPU device coredump file has been
created
amdgpu 0000:06:00.0: amdgpu: ring sdma0 timeout, signaled seq=11425,
emitted seq=11427
amdgpu 0000:06:00.0: amdgpu: Starting sdma0 ring reset
amdgpu 0000:06:00.0: amdgpu: Ring sdma0 reset succeeded
amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset
amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled
seq=16289, emitted seq=16291
amdgpu 0000:06:00.0: amdgpu: Process mpv pid 44985 thread vo pid 45004
amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.0.0 reset succeeded
amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset
amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.0 timeout, signaled seq=13,
emitted seq=14
amdgpu 0000:06:00.0: amdgpu: Process mpv pid 44985 thread vo pid 45004
amdgpu 0000:06:00.0: amdgpu: Ring comp_1.1.0 reset succeeded
amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset
amdgpu 0000:06:00.0: amdgpu: Fence fallback timer expired on ring sdma1
amdgpu 0000:06:00.0: [drm] *ERROR* [CRTC:416:crtc-0] flip_done timed
out
```

## Kernel Log (Crash 2 — hwdec=vaapi, Mesa 26.0.2)

```
amdgpu 0000:06:00.0: amdgpu: ring sdma0 timeout, signaled seq=13615,
emitted seq=13617
amdgpu 0000:06:00.0: amdgpu: Ring sdma0 reset succeeded
amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset
amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled
seq=30731, emitted seq=30733
amdgpu 0000:06:00.0: amdgpu: Process mpv pid 66481 thread vo pid 66500
amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.0.0 reset succeeded
amdgpu 0000:06:00.0: [drm] device wedged, but recovered through reset
amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=312,
emitted seq=313
```

## GPU Device Coredump (Crash 1)

```
**** AMDGPU Device Coredump ****
version: 1
kernel: 6.19.8-zen1-1-zen
module: amdgpu
time: 3054.167340782

SOC Device id: 30096
SOC Family: 152
SOC External Revision id: 65

HWIP: GC[1][0]: v12.0.0.0.0
HWIP: SDMA0[3][0]: v7.0.0.0.0
HWIP: MMHUB[12][0]: v4.1.0.0.0

Ring timed out details
IP Type: 2 Ring Name: sdma0

[gfxhub] Page fault observed
Faulty page starting at address: 0x0000000000000000
Protection fault status register: 0x0
```

**Full coredump available on request** (543 KB).

## Analysis

- The crash is a **NULL pointer dereference at GPU virtual address
0x0** — RADV is submitting commands that reference unmapped memory.
- The `Protection fault status register: 0x0` suggests the fault info
itself is zeroed, which may indicate the fault occurred very early in
command processing or in an SDMA copy from a NULL source.
- The fault hits sdma0 first, then cascades to gfx_0.0.0 and a compute
ring — consistent with a resource upload (SDMA) referencing a NULL
buffer, followed by the GFX/compute rings trying to use the result.
- After ring resets, the GPU fully recovers (all fences drain, PCIe
link stays up at 32 GT/s x16), confirming this is a userspace (RADV)
command stream issue, not a hardware or kernel driver bug.
- The `flip_done timed out` on CRTC-0 is a secondary effect — the
compositor's page flip can't complete while rings are being reset,
which restarts the GNOME session.

## Additional Notes

- The GPU is connected via Thunderbolt 4 (eGPU enclosure), but the PCIe
link stays healthy through the crash — this is not a link/BAR issue.
- This was also reproduced on Mesa 26.0.1 with kernel 6.19.6 and
firmware 20260221 (SMC 102.69.0) — same crash signature.
- Desktop compositing (GNOME Shell / Mutter on Wayland) works fine on
this GPU — only mpv's libplacebo rendering pipeline triggers the crash.
- `vulkan-async-compute=yes` was enabled. Not yet tested with async
compute disabled, though the fault is on sdma0, not a compute ring.

Reply via email to