Hello amd-gfx,

I am reporting a reproducible amdgpu failure on gfx1150 (Strix / Radeon 
880M/890M) observed on Linux 6.19-rc2. The issue appears to be a real GPU VM / 
illegal access fault that reliably escalates into an unrecoverable reset.

This is not related to ROCm or user compute workloads.

---

Hardware:

* APU: AMD Strix (gfx1150)
* GPU: Radeon 880M / 890M (integrated)
* SMU: smu_v14_0_0
* Platform: x86_64 desktop
* Firmware: standard linux-firmware (no custom blobs)

Kernel:

* Linux 6.19.0-rc2
* amdgpu built as module
* DRM AMD DC enabled
* Default kernel configuration for modern AMD APU (no unusual options)

---

Observed failure (6.19-rc2):

During a long-running but otherwise normal graphics/compute workload, the 
kernel logs the following:

```
amdgpu 0000:c5:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 
pasid:0)
amdgpu 0000:c5:00.0: amdgpu:   in page starting at address 0x0000000000000000 
from client 10
amdgpu 0000:c5:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
amdgpu 0000:c5:00.0: amdgpu:          Faulty UTCL2 client ID: CPC (0x5)
amdgpu 0000:c5:00.0: amdgpu:          WALKER_ERROR: 0x1
amdgpu 0000:c5:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
amdgpu 0000:c5:00.0: amdgpu:          MAPPING_ERROR: 0x1
```

Shortly after, MES stops responding:

```
amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
amdgpu: failed to reg_write_reg_wait
```

The driver then attempts recovery/reset.


Reset / recovery behavior:

On gfx1150, recovery is not survivable:

* VPE queue reset fails
* Driver falls back to MODE2 reset
* SMU resumes successfully
* MES fails to respond when re-adding queues
* gfx_v11_0 resume fails with -110 (ETIMEDOUT)

Example reset log excerpt (also reproducible on 6.17.x):

```
amdgpu: GPU reset begin!
amdgpu: VPE queue reset failed
amdgpu: MODE2 reset
amdgpu: SMU is resumed successfully
amdgpu: MES failed to respond to msg=ADD_QUEUE
amdgpu: resume of IP block <gfx_v11_0> failed -110
amdgpu: GPU reset end with ret = -110
```

In practice this leaves the system unusable and often requires a power cycle.

---

Additional notes:

* This is reproducible on an otherwise idle system using `amd-smi reset 
--gpureset`.
* The same reset failure occurs on 6.17.10, so reset/recovery for gfx1150 
appears incomplete independent of the 6.19 regression.
* 6.19-rc2 increases the frequency of hitting recovery due to the CPC/gfxhub 
illegal access fault.
* This report focuses on the *trigger* (illegal access / page fault), not the 
reset issue itself.

---

Summary:

* The gfxhub CPC page fault at VA 0x0 appears to be a real bug in 6.19-rc2.
* Any recovery attempt on gfx1150 currently escalates into an unrecoverable 
state.
* Avoiding recovery (e.g. by disabling CWSR) avoids crashes but masks the 
underlying fault.

Please let me know if additional traces, bisect testing, or instrumentation 
would be helpful.

Thank you for your time.

Best regards,
Harris Landgarten


Harris Landgarten
516 643-1286

Reply via email to