On 5/12/26 09:04, Sunil Khatri wrote: > CPU0: hang_detect_work → directly calls reset_work() > CPU1: evict_all → queues reset_work (via workqueue) > > There is a possibility of two reset thread running at same time. > To avoid that we add a per queue manager flag to avoid duplication.
Clear NAK, that doesn't make sense. All reset work must run on a single threaded reset queue, so only one work at a time can run. If multiple reset sources trigger at the same time (which is quite common) then the ones handled by a reset are canceled as soon as the reset is completed. Regards, Christian. > > Signed-off-by: Sunil Khatri <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 16 ++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_userq.h | 1 + > 2 files changed, 17 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > index 0a1fc45f5b4e..1440f51b667f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c > @@ -109,6 +109,19 @@ static void amdgpu_userq_mgr_reset_work(struct > work_struct *work) > if (!amdgpu_gpu_recovery) > return; > > + /* > + * Prevent concurrent/duplicate reset executions. Both hang_detect_work > + * (direct call) and evict_all (via schedule+flush_work) can invoke this > + * function simultaneously. Use an atomic test-and-set so only the first > + * caller proceeds; the second exits early. > + * > + * Note: amdgpu_in_reset() cannot be used here because in_gpu_reset is > + * only set deep inside amdgpu_device_gpu_recover(), well after we've > + * already entered this function. > + */ > + if (atomic_cmpxchg(&uq_mgr->reset_in_progress, 0, 1) != 0) > + return; > + > /* > * Iterate through all queue types to detect and reset problematic > queues > * Process each queue type in the defined order > @@ -145,6 +158,8 @@ static void amdgpu_userq_mgr_reset_work(struct > work_struct *work) > > amdgpu_device_gpu_recover(adev, NULL, &reset_context); > } > + > + atomic_set(&uq_mgr->reset_in_progress, 0); > } > > static void amdgpu_userq_hang_detect_work(struct work_struct *work) > @@ -1304,6 +1319,7 @@ int amdgpu_userq_mgr_init(struct amdgpu_userq_mgr > *userq_mgr, struct drm_file *f > > INIT_DELAYED_WORK(&userq_mgr->resume_work, amdgpu_userq_restore_worker); > INIT_WORK(&userq_mgr->reset_work, amdgpu_userq_mgr_reset_work); > + atomic_set(&userq_mgr->reset_in_progress, 0); > return 0; > } > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.h > index 49b33e2d6932..2748ecc0f6c9 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.h > @@ -129,6 +129,7 @@ struct amdgpu_userq_mgr { > * Reset work which is used when eviction fails. > */ > struct work_struct reset_work; > + atomic_t reset_in_progress; > atomic_t userq_count[AMDGPU_RING_TYPE_MAX]; > }; >
