On 5/19/26 15:05, Mikhail Gavrilov wrote:
> On Wed, Apr 29, 2026 at 7:37 PM Mikhail Gavrilov
> <[email protected]> wrote:
>>
>> When dumping IB contents from a hung job, amdgpu_devcoredump_format()
>> acquires the VM root PD's reservation lock via amdgpu_vm_lock_by_pasid()
>> and then, for each IB referenced by the job, calls amdgpu_bo_reserve()
>> on the BO that backs the IB.  Both reservations are taken on
>> reservation_ww_class_mutex objects but neither uses a ww_acquire_ctx,
>> which trips lockdep:
>>
>>   WARNING: possible recursive locking detected
>>   --------------------------------------------
>>   kworker/u128:0 is trying to acquire lock:
>>   ffff88838b16e1f0 (reservation_ww_class_mutex){+.+.}-{4:4},
>>     at: amdgpu_devcoredump_format+0x1594/0x23f0 [amdgpu]
>>
>>   but task is already holding lock:
>>   ffff8882f82681f0 (reservation_ww_class_mutex){+.+.}-{4:4},
>>     at: amdgpu_devcoredump_format+0x1594/0x23f0 [amdgpu]
>>
>>    Possible unsafe locking scenario:
>>          CPU0
>>          ----
>>     lock(reservation_ww_class_mutex);
>>     lock(reservation_ww_class_mutex);
>>
>>    *** DEADLOCK ***
>>    May be due to missing lock nesting notation
>>
>>   Workqueue: events_unbound amdgpu_devcoredump_deferred_work [amdgpu]
>>   Call Trace:
>>    __ww_mutex_lock.constprop.0
>>    ww_mutex_lock
>>    amdgpu_bo_reserve
>>    amdgpu_devcoredump_format+0x1594 [amdgpu]
>>    amdgpu_devcoredump_deferred_work+0xea [amdgpu]
>>    process_one_work
>>    worker_thread
>>    kthread
>>
> 
> Friendly ping. Pierre-Eric, Christian, Alex — any thoughts on this fix?
> 
> Happy to spin a v2 with any review feedback. One thing I'm aware of:
> the `Cc: [email protected] # 7.1` tag is probably unnecessary
> since the regression only landed in 7.1-rc1 and the fix will reach 7.1
> final naturally via drm-fixes; I can drop it in v2 if preferred.
> 

Good catch, but the fix is complete overkill.

You can lock multiple BOs at the same time, something like that here should do 
it:

        drm_exec_init(&exec, DRM_EXEC_IGNORE_DUPLICATES, 2);
        drm_exec_until_all_locked(&exec) {
                ret = amdgpu_vm_lock_pd(vm, &exec, 1);
                drm_exec_retry_on_contention(&exec);
                if (unlikely(ret))
                        goto fail_lock;

                mapping = amdgpu_vm_bo_lookup_mapping(vm, ib_addr >> 
PAGE_SHIFT);
                if (!wptr_mapping) {
                        ret = -EINVAL;
                        goto fail_lock; 
                }

                obj = mapping->bo_va->base.bo;
                ret = drm_exec_lock_obj(&exec, &obj->tbo.base);
                drm_exec_retry_on_contention(&exec);
                if (unlikely(ret))
                        goto fail_lock;
        }

@Pierre-Eric can you take a look at that as well?

Thanks in advance,
Christian.

Reply via email to