On 3/12/26 10:16, Sunil Khatri wrote: > cancel_delayed_work_sync for work hand_detect_work should not be > locked since the amdgpu_userq_hang_detect_work also need the same > mutex and when they run together it could be a deadlock. > > we do not need to hold the mutex for > cancel_delayed_work_sync(&queue->hang_detect_work). With this in place > if cancel and worker thread run at same time they will not deadlock. > > Due to any failures if there is a hand detect and reset that there a > deadlock scenarios between cancel and running the main thread. > > [ 243.118276] task:kworker/9:0 state:D stack:0 pid:73 tgid:73 ppid:2 > task_flags:0x4208060 flags:0x00080000 > [ 243.118283] Workqueue: events amdgpu_userq_hang_detect_work [amdgpu] > [ 243.118636] Call Trace: > [ 243.118639] <TASK> > [ 243.118644] __schedule+0x581/0x1810 > [ 243.118649] ? srso_return_thunk+0x5/0x5f > [ 243.118656] ? srso_return_thunk+0x5/0x5f > [ 243.118659] ? wake_up_process+0x15/0x20 > [ 243.118665] schedule+0x64/0xe0 > [ 243.118668] schedule_preempt_disabled+0x15/0x30 > [ 243.118671] __mutex_lock+0x346/0x950 > [ 243.118677] __mutex_lock_slowpath+0x13/0x20 > [ 243.118681] mutex_lock+0x2c/0x40 > [ 243.118684] amdgpu_userq_hang_detect_work+0x63/0x90 [amdgpu] > [ 243.118888] process_scheduled_works+0x1f0/0x450 > [ 243.118894] worker_thread+0x27f/0x370 > [ 243.118899] kthread+0x1ed/0x210 > [ 243.118903] ? __pfx_worker_thread+0x10/0x10 > [ 243.118906] ? srso_return_thunk+0x5/0x5f > [ 243.118909] ? __pfx_kthread+0x10/0x10 > [ 243.118913] ret_from_fork+0x10f/0x1b0 > [ 243.118916] ? __pfx_kthread+0x10/0x10 > [ 243.118920] ret_from_fork_asm+0x1a/0x30
Good catch, but userq destruction is completely broken in quite a number of ways. Have you taken a look at my patch "drm/amdgpu: fix eviction fence and userq manager shutdown"? How does this here interacts with that? Thanks, Christian. > > Signed-off-by: Sunil Khatri <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > index 32541f1bde6d..c5875e175918 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > @@ -621,15 +621,22 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, > struct amdgpu_usermode_que > { > struct amdgpu_device *adev = uq_mgr->adev; > int r = 0; > + bool hang_detect_fence = false; > > cancel_delayed_work_sync(&uq_mgr->resume_work); > mutex_lock(&uq_mgr->userq_mutex); > amdgpu_userq_wait_for_last_fence(queue); > /* Cancel any pending hang detection work and cleanup */ > if (queue->hang_detect_fence) { > - cancel_delayed_work_sync(&queue->hang_detect_work); > + hang_detect_fence = true; > queue->hang_detect_fence = NULL; > } > + mutex_unlock(&uq_mgr->userq_mutex); > + > + if (hang_detect_fence) > + cancel_delayed_work_sync(&queue->hang_detect_work); > + > + mutex_lock(&uq_mgr->userq_mutex); > r = amdgpu_bo_reserve(queue->db_obj.obj, true); > if (!r) { > amdgpu_bo_unpin(queue->db_obj.obj);
