On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <l...@antheas.dev> wrote: > > On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeuc...@gmail.com> wrote: > > > > On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <l...@antheas.dev> > > wrote: > > > > > > On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the > > > suspend resumes result in a soft lock around 1 second after the screen > > > turns on (it freezes). This happens due to power gating VPE when it is > > > not used, which happens 1 second after inactivity. > > > > > > Specifically, the VPE gating after resume is as follows: an initial > > > ungate, followed by a gate in the resume process. Then, > > > amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled > > > to run tests, one of which is testing VPE in vpe_ring_test_ib. This > > > causes an ungate, After that test, vpe_idle_work_handler is scheduled > > > with VPE_IDLE_TIMEOUT (1s). > > > > > > When vpe_idle_work_handler runs and tries to gate VPE, it causes the > > > SMU to hang and partially freezes half of the GPU IPs, with the thread > > > that called the command being stuck processing it. > > > > > > Specifically, after that SMU command tries to run, we get the following: > > > > > > snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot > > > ... > > > xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot > > > ... > > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous > > > command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE! > > > [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, > > > ret = -62. > > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out > > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous > > > command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG! > > > [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg > > > failed, ret = -62. > > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous > > > command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0! > > > [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62. > > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3 > > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5 > > > thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot > > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out > > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous > > > command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000 > > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1! > > > > > > In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU. > > > Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5, > > > a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the > > > PowerDownVpe(50) command which is the common failure point in all > > > failed resumes. > > > > > > On a normal resume, we should get the following power gates: > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: > > > 0x00000000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: > > > 0x00000000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: > > > 0x00010000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: > > > 0x00010000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: > > > 0x00000000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: > > > 0x00000000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: > > > 0x00010000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: > > > 0x00000000, resp: 0x00000001 > > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: > > > 0x00010000, resp: 0x00000001 > > > > > > To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases > > > reliability from 4-25 suspends to 200+ (tested) suspends with a cycle > > > time of 12s sleep, 8s resume. The suspected reason here is that 1s that > > > when VPE is used, it needs a bit of time before it can be gated and > > > there was a borderline delay before, which is not enough for Strix Halo. > > > When the VPE is not used, such as on resume, gating it instantly does > > > not seem to cause issues. > > > > This doesn't make much sense. The VPE idle timeout is arbitrary. The > > VPE idle work handler checks to see if the block is idle before it > > powers gates the block. If it's not idle, then the delayed work is > > rescheduled so changing the timing should not make a difference. We > > are no powering down VPE while it still has active jobs. It sounds > > like there is some race condition somewhere else. > > On resume, the vpe is ungated and gated instantly, which does not > cause any crashes, then the delayed work is scheduled to run two > seconds later. Then, the tests run and finish, which start the gate > timer. After the timer lapses and the kernel tries to gate VPE, it > crashes. I logged all SMU commands and there is no difference between > the ones in a crash and not, other than the fact the VPE gate command > failed. Which becomes apparent when the next command runs. I will also > note that until the idle timer lapses, the system is responsive > > Since the VPE is ungated to run the tests, I assume that in my setup > it is not used close to resume.
I should also add that I forced a kernel panic and dumped all CPU backtraces in multiple logs. After the softlock, CPUs were either parked in the scheduler, powered off, or stuck executing an SMU command by e.g., a userspace usage sensor graph. So it is not a deadlock. Antheas > Antheas > > > Alex > > > > > > > > Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm") > > > Signed-off-by: Antheas Kapenekakis <l...@antheas.dev> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++-- > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > > index 121ee17b522b..24f09e457352 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c > > > @@ -34,8 +34,8 @@ > > > /* VPE CSA resides in the 4th page of CSA */ > > > #define AMDGPU_CSA_VPE_OFFSET (4096 * 3) > > > > > > -/* 1 second timeout */ > > > -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000) > > > +/* 2 second timeout */ > > > +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000) > > > > > > #define VPE_MAX_DPM_LEVEL 4 > > > #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8 > > > > > > base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 > > > -- > > > 2.50.1 > > > > > > > >