On 5/19/26 15:05, Mikhail Gavrilov wrote:
> On Wed, Apr 29, 2026 at 7:37 PM Mikhail Gavrilov
> <[email protected]> wrote:
>>
>> When dumping IB contents from a hung job, amdgpu_devcoredump_format()
>> acquires the VM root PD's reservation lock via amdgpu_vm_lock_by_pasid()
>> and then, for each IB referenced by the job, calls amdgpu_bo_reserve()
>> on the BO that backs the IB. Both reservations are taken on
>> reservation_ww_class_mutex objects but neither uses a ww_acquire_ctx,
>> which trips lockdep:
>>
>> WARNING: possible recursive locking detected
>> --------------------------------------------
>> kworker/u128:0 is trying to acquire lock:
>> ffff88838b16e1f0 (reservation_ww_class_mutex){+.+.}-{4:4},
>> at: amdgpu_devcoredump_format+0x1594/0x23f0 [amdgpu]
>>
>> but task is already holding lock:
>> ffff8882f82681f0 (reservation_ww_class_mutex){+.+.}-{4:4},
>> at: amdgpu_devcoredump_format+0x1594/0x23f0 [amdgpu]
>>
>> Possible unsafe locking scenario:
>> CPU0
>> ----
>> lock(reservation_ww_class_mutex);
>> lock(reservation_ww_class_mutex);
>>
>> *** DEADLOCK ***
>> May be due to missing lock nesting notation
>>
>> Workqueue: events_unbound amdgpu_devcoredump_deferred_work [amdgpu]
>> Call Trace:
>> __ww_mutex_lock.constprop.0
>> ww_mutex_lock
>> amdgpu_bo_reserve
>> amdgpu_devcoredump_format+0x1594 [amdgpu]
>> amdgpu_devcoredump_deferred_work+0xea [amdgpu]
>> process_one_work
>> worker_thread
>> kthread
>>
>
> Friendly ping. Pierre-Eric, Christian, Alex — any thoughts on this fix?
>
> Happy to spin a v2 with any review feedback. One thing I'm aware of:
> the `Cc: [email protected] # 7.1` tag is probably unnecessary
> since the regression only landed in 7.1-rc1 and the fix will reach 7.1
> final naturally via drm-fixes; I can drop it in v2 if preferred.
>
Good catch, but the fix is complete overkill.
You can lock multiple BOs at the same time, something like that here should do
it:
drm_exec_init(&exec, DRM_EXEC_IGNORE_DUPLICATES, 2);
drm_exec_until_all_locked(&exec) {
ret = amdgpu_vm_lock_pd(vm, &exec, 1);
drm_exec_retry_on_contention(&exec);
if (unlikely(ret))
goto fail_lock;
mapping = amdgpu_vm_bo_lookup_mapping(vm, ib_addr >>
PAGE_SHIFT);
if (!wptr_mapping) {
ret = -EINVAL;
goto fail_lock;
}
obj = mapping->bo_va->base.bo;
ret = drm_exec_lock_obj(&exec, &obj->tbo.base);
drm_exec_retry_on_contention(&exec);
if (unlikely(ret))
goto fail_lock;
}
@Pierre-Eric can you take a look at that as well?
Thanks in advance,
Christian.