On Thu, May 21 2026 at 11:20, Bert Karwatzki wrote:
> Am Donnerstag, dem 21.05.2026 um 11:09 +0200 schrieb Mateusz Guzik:
>
> with next-20260519 (no RT, no LOCKDEP) and got no crash so far (4 boots only 
> though (next-20260619
> crashed in 2 out of 3 boots without RT)) but I get this warning on every boot:
>
> [    2.793416] [    T331] ------------[ cut here ]------------
> [    2.793433] [    T331] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> [    2.793434] [    T331] WARNING: kernel/locking/mutex.c:625 at 
> __mutex_lock+0x586/0x10c0, CPU#17: (udev-worker)/331

So either the mutex is corrupted or was never initialized.

> [    2.793463] [    T331] Modules linked in: amdgpu(+) hid_generic usbhid 
> drm_client_lib i2c_algo_bit drm_buddy hid drm_ttm_helper ttm drm_exec
> drm_suballoc_helper mfd_core drm_panel_backlight_quirks gpu_sched amdxcp 
> drm_display_helper drm_kms_helper ahci libahci xhci_pci libata xhci_hcd drm 
> nvme
> scsi_mod igc usbcore nvme_core scsi_common video nvme_keyring i2c_piix4 cec 
> nvme_auth usb_common crc16 i2c_smbus wmi gpio_amdpt gpio_generic
> [    2.793518] [    T331] CPU: 17 UID: 0 PID: 331 Comm: (udev-worker) Not 
> tainted 7.1.0-rc4-next-20260519-rcunortlockdep-dirty #465 PREEMPT 
> [    2.793534] [    T331] Hardware name: ASUS System Product Name/ROG STRIX 
> B850-F GAMING WIFI, BIOS 1627 02/05/2026
> [    2.793547] [    T331] RIP: 0010:__mutex_lock+0x58d/0x10c0
> [    2.793555] [    T331] Code: 4c 8b 4d 88 85 c0 0f 84 f8 fa ff ff 44 8b 15 
> ca 9b 81 00 45 85 d2 0f 85 e8 fa ff ff 48 8d 3d 1a 57 82 00 48 c7 c6 a6 51 9e 
> 83
> <67> 48 0f b9 3a 4c 8b 4d 88 e9 cc fa ff ff 48 8b bd 78 ff ff ff e8
> [    2.793579] [    T331] RSP: 0018:ffffa497016c3510 EFLAGS: 00010246
> [    2.793588] [    T331] RAX: 0000000000000001 RBX: ffff88c33a4c2ad8 RCX: 
> 0000000000000000
> [    2.793598] [    T331] RDX: 0000000000000001 RSI: ffffffff839e51a6 RDI: 
> ffffffff83de3c00
> [    2.793609] [    T331] RBP: ffffa497016c35c0 R08: ffffffffc0a55d92 R09: 
> 0000000000000000
> [    2.793619] [    T331] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 0000000000000000
> [    2.793629] [    T331] R13: 0000000000000002 R14: ffffa497016c3550 R15: 
> 0000000000268000
> [    2.793641] [    T331] FS:  00007f1f32e5b9c0(0000) 
> GS:ffff88d23b2ca000(0000) knlGS:0000000000000000
> [    2.793653] [    T331] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    2.793662] [    T331] CR2: 000055cdfa28f588 CR3: 0000000112e73000 CR4: 
> 0000000000f50ef0
> [    2.793673] [    T331] PKRU: 55555554
> [    2.793678] [    T331] Call Trace:
> [    2.793683] [    T331]  <TASK>
> [    2.793687] [    T331]  ? lock_acquire+0xbe/0x2d0
> [    2.793696] [    T331]  ? init_mqd+0x122/0x190 [amdgpu]
> [    2.793809] [    T331]  ? lock_release+0xc6/0x2a0
> [    2.793816] [    T331]  ? init_mqd+0x122/0x190 [amdgpu]
> [    2.793902] [    T331]  init_mqd+0x122/0x190 [amdgpu]
> [    2.793961] [    T331]  init_mqd_hiq+0xd/0x20 [amdgpu]
> [    2.794015] [    T331]  kq_initialize.constprop.0+0x2b8/0x370 [amdgpu]
> [    2.794071] [    T331]  kernel_queue_init+0x3f/0x60 [amdgpu]
> [    2.794125] [    T331]  pm_init+0x6b/0x100 [amdgpu]
> [    2.794178] [    T331]  start_cpsch+0x1d6/0x270 [amdgpu]
> [    2.794234] [    T331]  kgd2kfd_device_init.cold+0x7b9/0xa1a [amdgpu]
> [    2.794365] [    T331]  amdgpu_amdkfd_device_init+0x190/0x260 [amdgpu]

amdgpu_amdkfd_device_init()
  kgd2kfd_device_init() {
      ....
        init_mqd()
          mutex_lock(... profiler_lock); <- FAIL

      mutex_init(...profiler_lock);
  }

Seems the famous graphics CI failed to catch this...

Thanks,

        tglx
---
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -744,6 +744,9 @@ bool kgd2kfd_device_init(struct kfd_dev
                        KGD_ENGINE_SDMA1);
        kfd->shared_resources = *gpu_resources;
 
+       kfd->profiler_process = NULL;
+       mutex_init(&kfd->profiler_lock);
+
        kfd->num_nodes = amdgpu_xcp_get_num_xcp(kfd->adev->xcp_mgr);
 
        if (kfd->num_nodes == 0) {
@@ -936,9 +939,6 @@ bool kgd2kfd_device_init(struct kfd_dev
 
        svm_range_set_max_pages(kfd->adev);
 
-       kfd->profiler_process = NULL;
-       mutex_init(&kfd->profiler_lock);
-
        kfd->init_complete = true;
        dev_info(kfd_device, "added device %x:%x\n", kfd->adev->pdev->vendor,
                 kfd->adev->pdev->device);

Reply via email to