On Fri, May 29, 2026 at 5:23 AM Tvrtko Ursulin
<[email protected]> wrote:
>
> A call chain at driver probe exists where profiler lock is used before it
> is initialized:
>
> [   12.131440] kfd kfd: Allocated 3969056 bytes on gart
> [   12.131561] kfd kfd: Total number of KFD nodes to be created: 1
> [   12.132691] ------------[ cut here ]------------
> [   12.132703] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> [   12.132705] WARNING: kernel/locking/mutex.c:625 at 
> __mutex_lock+0x616/0x1150, CPU#0: (udev-worker)/569
> ...
> [   12.133051] Call Trace:
> [   12.133055]  <TASK>
> [   12.133059]  ? mark_held_locks+0x40/0x70
> [   12.133068]  ? init_mqd+0xe1/0x1b0 [amdgpu 
> 5154987db73e842b9b4f761e2bd86e17c7ada65c]
> [   12.133671]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
> [   12.133683]  ? init_mqd+0xe1/0x1b0 [amdgpu 
> 5154987db73e842b9b4f761e2bd86e17c7ada65c]
> [   12.134235]  init_mqd+0xe1/0x1b0 [amdgpu 
> 5154987db73e842b9b4f761e2bd86e17c7ada65c]
> [   12.134781]  init_mqd_hiq+0x12/0x30 [amdgpu 
> 5154987db73e842b9b4f761e2bd86e17c7ada65c]
> [   12.135340]  kq_initialize.constprop.0+0x309/0x400 [amdgpu 
> 5154987db73e842b9b4f761e2bd86e17c7ada65c]
> [   12.135898]  kernel_queue_init+0x44/0x80 [amdgpu 
> 5154987db73e842b9b4f761e2bd86e17c7ada65c]
> [   12.136439]  pm_init+0x70/0x100 [amdgpu 
> 5154987db73e842b9b4f761e2bd86e17c7ada65c]
> [   12.136984]  start_cpsch+0x1dc/0x280 [amdgpu 
> 5154987db73e842b9b4f761e2bd86e17c7ada65c]
> [   12.137525]  kgd2kfd_device_init+0x70f/0xd10 [amdgpu 
> 5154987db73e842b9b4f761e2bd86e17c7ada65c]
> [   12.138070]  amdgpu_amdkfd_device_init+0x172/0x230 [amdgpu 
> 5154987db73e842b9b4f761e2bd86e17c7ada65c]
> [   12.138618]  amdgpu_device_init+0x246a/0x2960 [amdgpu 
> 5154987db73e842b9b4f761e2bd86e17c7ada65c]
>
> The human readable call chain is:
>
> kgd2kfd_device_init
>   kfd_init_node
>     kfd_resume
>       node->dqm->ops.start
>
> Where start can be start_cpsch, which calls pm_init, etc, which ends up
> calling kq->mqd_mgr->init_mqd, which takes the profiler lock:
>
> init_mqd()
> {
> ...
>         mutex_lock(&mm->dev->kfd->profiler_lock);
> ...
>
> Fix it by initializing the mutext at the top of kgd2kfd_device_init().
>
> Signed-off-by: Tvrtko Ursulin <[email protected]>
> Fixes: a789761de305 ("amd/amdkfd: Add kfd_ioctl_profiler to contain profiler 
> kernel driver changes")
> Cc: Benjamin Welton <[email protected]>
> Cc: Perry Yuan <[email protected]>
> Cc: Kent Russell <[email protected]>
> Cc: Yifan Zhang <[email protected]>
> Cc: Alex Deucher <[email protected]>
> Cc: Felix Kuehling <[email protected]>

Applied.  Thanks!

Alex

> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index c2c59781feee..1c57e11220a7 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -736,6 +736,9 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>         int partition_mode;
>         int xcp_idx;
>
> +       kfd->profiler_process = NULL;
> +       mutex_init(&kfd->profiler_lock);
> +
>         kfd->mec_fw_version = amdgpu_amdkfd_get_fw_version(kfd->adev,
>                         KGD_ENGINE_MEC1);
>         kfd->mec2_fw_version = amdgpu_amdkfd_get_fw_version(kfd->adev,
> @@ -936,9 +939,6 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>
>         svm_range_set_max_pages(kfd->adev);
>
> -       kfd->profiler_process = NULL;
> -       mutex_init(&kfd->profiler_lock);
> -
>         kfd->init_complete = true;
>         dev_info(kfd_device, "added device %x:%x\n", kfd->adev->pdev->vendor,
>                  kfd->adev->pdev->device);
> --
> 2.54.0
>

Reply via email to