On Mon, 25 Aug 2025 at 18:41, Mario Limonciello <supe...@kernel.org> wrote: > > On 8/25/2025 9:01 AM, Antheas Kapenekakis wrote: > > On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <l...@antheas.dev> wrote: > >> > >> On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeuc...@gmail.com> wrote: > >>> > >>> On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <l...@antheas.dev> > >>> wrote: > >>>> > >>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > >>>> suspend resumes result in a soft lock around 1 second after the screen > >>>> turns on (it freezes). This happens due to power gating VPE when it is > >>>> not used, which happens 1 second after inactivity. > >>>> > >>>> Specifically, the VPE gating after resume is as follows: an initial > >>>> ungate, followed by a gate in the resume process. Then, > >>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > >>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This > >>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled > >>>> with VPE_IDLE_TIMEOUT (1s). > >>>> > >>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the > >>>> SMU to hang and partially freezes half of the GPU IPs, with the thread > >>>> that called the command being stuck processing it. > >>>> > >>>> Specifically, after that SMU command tries to run, we get the following: > >>>> > >>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to > >>>> D3hot > >>>> ... > >>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > >>>> ... > >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous > >>>> command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > >>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, > >>>> ret = -62. > >>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous > >>>> command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > >>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable > >>>> jpeg failed, ret = -62. > >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous > >>>> command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > >>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > >>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > >>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > >>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > >>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous > >>>> command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > >>>> > >>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > >>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > >>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > >>>> PowerDownVpe(50) command which is the common failure point in all > >>>> failed resumes. > >>>> > >>>> On a normal resume, we should get the following power gates: > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: > >>>> 0x00000000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: > >>>> 0x00000000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: > >>>> 0x00010000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: > >>>> 0x00010000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: > >>>> 0x00000000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: > >>>> 0x00000000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: > >>>> 0x00010000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: > >>>> 0x00000000, resp: 0x00000001 > >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: > >>>> 0x00010000, resp: 0x00000001 > >>>> > >>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > >>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > >>>> time of 12s sleep, 8s resume. The suspected reason here is that 1s that > >>>> when VPE is used, it needs a bit of time before it can be gated and > >>>> there was a borderline delay before, which is not enough for Strix Halo. > >>>> When the VPE is not used, such as on resume, gating it instantly does > >>>> not seem to cause issues. > >>> > >>> This doesn't make much sense. The VPE idle timeout is arbitrary. The > >>> VPE idle work handler checks to see if the block is idle before it > >>> powers gates the block. If it's not idle, then the delayed work is > >>> rescheduled so changing the timing should not make a difference. We > >>> are no powering down VPE while it still has active jobs. It sounds > >>> like there is some race condition somewhere else. > >> > >> On resume, the vpe is ungated and gated instantly, which does not > >> cause any crashes, then the delayed work is scheduled to run two > >> seconds later. Then, the tests run and finish, which start the gate > >> timer. After the timer lapses and the kernel tries to gate VPE, it > >> crashes. I logged all SMU commands and there is no difference between > >> the ones in a crash and not, other than the fact the VPE gate command > >> failed. Which becomes apparent when the next command runs. I will also > >> note that until the idle timer lapses, the system is responsive > >> > >> Since the VPE is ungated to run the tests, I assume that in my setup > >> it is not used close to resume. > > > > I should also add that I forced a kernel panic and dumped all CPU > > backtraces in multiple logs. After the softlock, CPUs were either > > parked in the scheduler, powered off, or stuck executing an SMU > > command by e.g., a userspace usage sensor graph. So it is not a > > deadlock. > > > > Can you please confirm if you are on the absolute latest linux-firmware > when you reproduced this issue?
I was on the latest at the time built from source. I think it was commit 08ee93ff8ffa. There was an update today though it seems. > Can you please share the debugfs output for amdgpu_firmware_info. Here is the information from it: VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 35, firmware version: 0x0000001f PFP feature version: 35, firmware version: 0x0000002c CE feature version: 0, firmware version: 0x00000000 RLC feature version: 1, firmware version: 0x11530505 RLC SRLC feature version: 0, firmware version: 0x00000000 RLC SRLG feature version: 0, firmware version: 0x00000000 RLC SRLS feature version: 0, firmware version: 0x00000000 RLCP feature version: 1, firmware version: 0x11530505 RLCV feature version: 0, firmware version: 0x00000000 MEC feature version: 35, firmware version: 0x0000001f IMU feature version: 0, firmware version: 0x0b352300 SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 553648366, firmware version: 0x210000ee TA XGMI feature version: 0x00000000, firmware version: 0x00000000 TA RAS feature version: 0x00000000, firmware version: 0x00000000 TA HDCP feature version: 0x00000000, firmware version: 0x17000044 TA DTM feature version: 0x00000000, firmware version: 0x12000018 TA RAP feature version: 0x00000000, firmware version: 0x00000000 TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000 SMC feature version: 0, program: 0, firmware version: 0x00647000 (100.112.0) SDMA0 feature version: 60, firmware version: 0x0000000e VCN feature version: 0, firmware version: 0x0911800b DMCU feature version: 0, firmware version: 0x00000000 DMCUB feature version: 0, firmware version: 0x09002600 TOC feature version: 0, firmware version: 0x0000000b MES_KIQ feature version: 6, firmware version: 0x0000006c MES feature version: 1, firmware version: 0x0000007c VPE feature version: 60, firmware version: 0x00000016 VBIOS version: 113-STRXLGEN-001 I see there was an update today though Antheas >