[AMD Official Use Only] Checked the log paste below with Curry. The way to add this fix in vcn_v1_0_stop is not workable. As it will induce a circle calling(below) and lead to dead lock. VCN ring begin use -> amdgpu_dpm_enable_uvd -> acquire the smu_lock -> smu10_powergate_vcn -> amdgpu_device_ip_set_powergating_state -> vcn_v1_0_stop -> amdgpu_dpm_enable_uvd -> try to acquire the smu_lock again -> dead lock
BR Evan From: Gong, Curry <[email protected]> Sent: Monday, December 13, 2021 4:56 PM To: Zhu, James <[email protected]>; [email protected] Cc: Liu, Leo <[email protected]>; Quan, Evan <[email protected]>; Deucher, Alexander <[email protected]> Subject: RE: [PATCH] drm/amdgpu: When the VCN(1.0) block is suspended, powergating is explicitly enabled [AMD Official Use Only] Hi James: With the following patch, an error will be reported when the driver is loaded +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c @@ -1202,6 +1204,9 @@ static int vcn_v1_0_stop(struct amdgpu_device *adev) else r = vcn_v1_0_stop_spg_mode(adev); + if (adev->pm.dpm_enabled) + amdgpu_dpm_enable_uvd(adev, false); + return r; } $ dmesg [ 363.181081] INFO: task kworker/3:2:223 blocked for more than 120 seconds. [ 363.181150] Tainted: G OE 5.11.0-41-generic #45~20.04.1-Ubuntu [ 363.181208] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 363.181266] task:kworker/3:2 state:D stack: 0 pid: 223 ppid: 2 flags:0x00004000 [ 363.181276] Workqueue: events vcn_v1_0_idle_work_handler [amdgpu] [ 363.181612] Call Trace: [ 363.181618] __schedule+0x44c/0x8a0 [ 363.181627] schedule+0x4f/0xc0 [ 363.181631] schedule_preempt_disabled+0xe/0x10 [ 363.181636] __mutex_lock.isra.0+0x183/0x4d0 [ 363.181643] __mutex_lock_slowpath+0x13/0x20 [ 363.181648] mutex_lock+0x32/0x40 [ 363.181652] amdgpu_dpm_set_powergating_by_smu+0x9c/0x180 [amdgpu] [ 363.182055] amdgpu_dpm_enable_uvd+0x38/0x110 [amdgpu] [ 363.182454] vcn_v1_0_set_powergating_state+0x2e7e/0x3cf0 [amdgpu] [ 363.182776] amdgpu_device_ip_set_powergating_state+0x6c/0xc0 [amdgpu] [ 363.183028] smu10_powergate_vcn+0x2a/0x80 [amdgpu] [ 363.183361] pp_set_powergating_by_smu+0xc5/0x2b0 [amdgpu] [ 363.183699] amdgpu_dpm_set_powergating_by_smu+0xb6/0x180 [amdgpu] [ 363.184040] amdgpu_dpm_enable_uvd+0x38/0x110 [amdgpu] [ 363.184391] vcn_v1_0_idle_work_handler+0xe1/0x130 [amdgpu] [ 363.184667] process_one_work+0x220/0x3c0 [ 363.184674] worker_thread+0x4d/0x3f0 [ 363.184677] ? process_one_work+0x3c0/0x3c0 [ 363.184680] kthread+0x12b/0x150 [ 363.184685] ? set_kthread_struct+0x40/0x40 [ 363.184690] ret_from_fork+0x22/0x30 [ 363.184699] INFO: task kworker/2:2:233 blocked for more than 120 seconds. [ 363.184739] Tainted: G OE 5.11.0-41-generic #45~20.04.1-Ubuntu [ 363.184782] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 363.184825] task:kworker/2:2 state:D stack: 0 pid: 233 ppid: 2 flags:0x00004000 [ 363.184831] Workqueue: events amdgpu_device_delayed_init_work_handler [amdgpu] [ 363.185085] Call Trace: [ 363.185087] __schedule+0x44c/0x8a0 [ 363.185092] schedule+0x4f/0xc0 [ 363.185095] schedule_timeout+0x202/0x290 [ 363.185099] ? sched_clock_cpu+0x11/0xb0 [ 363.185105] wait_for_completion+0x94/0x100 [ 363.185110] __flush_work+0x12a/0x1e0 [ 363.185113] ? worker_detach_from_pool+0xc0/0xc0 [ 363.185119] __cancel_work_timer+0x10e/0x190 [ 363.185123] cancel_delayed_work_sync+0x13/0x20 [ 363.185126] vcn_v1_0_ring_begin_use+0x20/0x70 [amdgpu] [ 363.185401] amdgpu_ring_alloc+0x48/0x60 [amdgpu] [ 363.185640] amdgpu_ib_schedule+0x493/0x600 [amdgpu] [ 363.185884] amdgpu_job_submit_direct+0x3c/0xd0 [amdgpu] [ 363.186186] amdgpu_vcn_dec_send_msg+0x105/0x210 [amdgpu] [ 363.186460] amdgpu_vcn_dec_ring_test_ib+0x69/0x110 [amdgpu] [ 363.186734] amdgpu_ib_ring_tests+0xf5/0x160 [amdgpu] [ 363.186978] amdgpu_device_delayed_init_work_handler+0x15/0x30 [amdgpu] [ 363.187206] process_one_work+0x220/0x3c0 [ 363.187210] worker_thread+0x4d/0x3f0 [ 363.187214] ? process_one_work+0x3c0/0x3c0 [ 363.187217] kthread+0x12b/0x150 [ 363.187221] ? set_kthread_struct+0x40/0x40 [ 363.187226] ret_from_fork+0x22/0x30 BR Curry Gong From: Zhu, James <[email protected]<mailto:[email protected]>> Sent: Saturday, December 11, 2021 5:07 AM To: Gong, Curry <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]> Cc: Liu, Leo <[email protected]<mailto:[email protected]>>; Zhu, James <[email protected]<mailto:[email protected]>>; Quan, Evan <[email protected]<mailto:[email protected]>>; Deucher, Alexander <[email protected]<mailto:[email protected]>> Subject: Re: [PATCH] drm/amdgpu: When the VCN(1.0) block is suspended, powergating is explicitly enabled On 2021-12-10 6:41 a.m., chen gong wrote: Play a video on the raven (or PCO, raven2) platform, and then do the S3 test. When resume, the following error will be reported: amdgpu 0000:02:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vcn_dec test failed (-110) [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vcn_v1_0> failed -110 amdgpu 0000:02:00.0: amdgpu: amdgpu_device_ip_resume failed (-110). PM: dpm_run_callback(): pci_pm_resume+0x0/0x90 returns -110 [why] When playing the video: The power state flag of the vcn block is set to POWER_STATE_ON. When doing suspend: There is no change to the power state flag of the vcn block, it is still POWER_STATE_ON. When doing resume: Need to open the power gate of the vcn block and set the power state flag of the VCN block to POWER_STATE_ON. But at this time, the power state flag of the vcn block is already POWER_STATE_ON. The power status flag check in the "8f2cdef drm/amd/pm: avoid duplicate powergate/ungate setting" patch will return the amdgpu_dpm_set_powergating_by_smu function directly. As a result, the gate of the power was not opened, causing the subsequent ring test to fail. [how] In the suspend function of the vcn block, explicitly change the power state flag of the vcn block to POWER_STATE_OFF. Signed-off-by: chen gong <[email protected]><mailto:[email protected]> --- drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c index d54d720..d73676b 100644 --- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c @@ -246,6 +246,13 @@ static int vcn_v1_0_suspend(void *handle) { int r; struct amdgpu_device *adev = (struct amdgpu_device *)handle; + bool cancel_success; + + cancel_success = cancel_delayed_work_sync(&adev->vcn.idle_work); [JZ] Can you refer to vcn_v3_0_stop , and add amdgpu_dpm_enable_uvd(adev, false); to the end of vcn_v1_0_stop? See if it also can help. + if (cancel_success) { + if (adev->pm.dpm_enabled) + amdgpu_dpm_enable_uvd(adev, false); + } r = vcn_v1_0_hw_fini(adev); if (r)
