Hello, I have a system running Debian unstable with an AMD RX-570. It has been working fine for a while, but recently, anything that uses the more advanced features of the GPU causes the system to hard lockup: black screen, no response to keyboard, no network connectivity.
I'm not sure exactly which functionality of the GPU causes this: ordinary web browsing, development work, etc. never cause problems, but games, the Unigine Heaven benchmark, and even glmark2 invariably do, sometimes immediately, sometimes after a few seconds or minutes. I'm not sure exactly when this began: I hadn't been using the system for any of the problematic tasks for a while. I've tried looking in the logs. Running 'journalctl -b -1' after a lockup generally shows nothing. I've tried to catch the error with 'tail -F /var/log/syslog', and most of the time I see nothing (just the hang, with no warning in the log), but once I caught this: 2023-08-01T10:41:23.531381-04:00 lucy kernel: [38532.241396] gmc_v8_0_process_interrupt: 15 callbacks suppressed 2023-08-01T10:41:23.531394-04:00 lucy kernel: [38532.241401] amdgpu 0000:02:00.0: amdgpu: GPU fault detected: 147 0x06508401 for process heaven_x64 pid 14771 thread heaven_x64:cs0 pid 14792 2023-08-01T10:41:23.531395-04:00 lucy kernel: [38532.241407] amdgpu 0000:02:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x060004CA 2023-08-01T10:41:23.531396-04:00 lucy kernel: [38532.241408] amdgpu 0000:02:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02084001 2023-08-01T10:41:23.531396-04:00 lucy kernel: [38532.241409] amdgpu 0000:02:00.0: amdgpu: VM fault (0x01, vmid 1, pasid 32778) at page 100664522, read from 'TC7' (0x54433700) (132) 2023-08-01T10:41:23.531397-04:00 lucy kernel: [38532.241429] DMAR: DRHD: handling fault status reg 2 2023-08-01T10:41:23.531398-04:00 lucy kernel: [38532.241433] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfbff5000 [fault reason 0x05] PTE Write access is not set 2023-08-01T10:41:23.531399-04:00 lucy kernel: [38532.241438] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfbfd1000 [fault reason 0x05] PTE Write access is not set 2023-08-01T10:41:23.531400-04:00 lucy kernel: [38532.241442] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfbffd000 [fault reason 0x05] PTE Write access is not set 2023-08-01T10:41:23.531409-04:00 lucy kernel: [38532.241445] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfffe0000 [fault reason 0x05] PTE Write access is not set 2023-08-01T10:41:23.531409-04:00 lucy kernel: [38532.241449] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfbffd000 [fault reason 0x05] PTE Write access is not set 2023-08-01T10:41:23.531410-04:00 lucy kernel: [38532.241453] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xffff8000 [fault reason 0x05] PTE Write access is not set 2023-08-01T10:41:23.531411-04:00 lucy kernel: [38532.241456] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfffd8000 [fault reason 0x05] PTE Write access is not set 2023-08-01T10:41:23.531412-04:00 lucy kernel: [38532.241460] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfbbc1000 [fault reason 0x05] PTE Write access is not set 2023-08-01T10:41:23.531412-04:00 lucy kernel: [38532.241460] pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: 0000:00:02.0 2023-08-01T10:41:23.531413-04:00 lucy kernel: [38532.241464] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xeffc0000 [fault reason 0x05] PTE Write access is not set 2023-08-01T10:41:23.531414-04:00 lucy kernel: [38532.241477] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID) 2023-08-01T10:41:23.531415-04:00 lucy kernel: [38532.241482] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00040000/00000000 2023-08-01T10:41:23.531415-04:00 lucy kernel: [38532.241487] pcieport 0000:00:02.0: [18] MalfTLP (First) 2023-08-01T10:41:23.531416-04:00 lucy kernel: [38532.241492] pcieport 0000:00:02.0: AER: TLP Header: 00001000 020024ff aaa800c0 00000000 2023-08-01T10:41:23.531417-04:00 lucy kernel: [38532.241500] [drm] PCI error: detected callback, state(2)!! I've found similar reports online, e.g.: https://unix.stackexchange.com/questions/327730/what-causes-this-pcieport-00000003-0-pcie-bus-error-aer-bad-tlp https://forums.linuxmint.com/viewtopic.php?t=380748 https://gitlab.freedesktop.org/drm/amd/-/issues/2358 But I'm really not clear whether these represent the same problem, or are just different variations of a more general driver / firmware problem. (I'm assuming it's software / firmware, since everything worked fine previously, although I suppose it's possible that something physical has broken in the hardware.) Any ideas? -- Celejar

