On 23.06.25 17:23, vitaly prosyak wrote: > > On 2025-06-23 11:16, Christian König wrote: >> On 20.06.25 15:49, Alex Deucher wrote: >>> On Wed, Jun 18, 2025 at 11:58 PM <vitaly.pros...@amd.com> wrote: >>>> From: Vitaly Prosyak <vitaly.pros...@amd.com> >>>> >>>> The issue was reproduced on NV10 using IGT pci_unplug test. >>>> It is expected that `amdgpu_driver_postclose_kms()` is called prior to >>>> `amdgpu_drm_release()`. >>>> However, the bug is that `amdgpu_fpriv` was freed in >>>> `amdgpu_driver_postclose_kms()`, and then >>>> later accessed in `amdgpu_drm_release()` via a call to >>>> `amdgpu_userq_mgr_fini()`. >>>> As a result, KASAN detected a use-after-free condition, as shown in the >>>> log below. >>>> The proposed fix is to move the calls to `amdgpu_eviction_fence_destroy()` >>>> and >>>> `amdgpu_userq_mgr_fini()` into `amdgpu_driver_postclose_kms()`, so they >>>> are invoked before >>>> `amdgpu_fpriv` is freed. >>>> >>>> This also ensures symmetry with the initialization path in >>>> `amdgpu_driver_open_kms()`, >>>> where the following components are initialized: >>>> - `amdgpu_userq_mgr_init()` >>>> - `amdgpu_eviction_fence_init()` >>>> - `amdgpu_ctx_mgr_init()` >>>> >>>> Correspondingly, in `amdgpu_driver_postclose_kms()` we should clean up >>>> using: >>>> - `amdgpu_userq_mgr_fini()` >>>> - `amdgpu_eviction_fence_destroy()` >>>> - `amdgpu_ctx_mgr_fini()` >>>> >>>> This change eliminates the use-after-free and improves consistency in >>>> resource management between open and close paths. >>>> >>>> [ +0.094367] >>>> ================================================================== >>>> [ +0.000026] BUG: KASAN: slab-use-after-free in >>>> amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] >>>> [ +0.000866] Write of size 8 at addr ffff88811c068c60 by task >>>> amd_pci_unplug/1737 >>>> [ +0.000026] CPU: 3 UID: 0 PID: 1737 Comm: amd_pci_unplug Not tainted >>>> 6.14.0+ #2 >>>> [ +0.000008] Hardware name: ASUS System Product Name/ROG STRIX B550-F >>>> GAMING (WI-FI), BIOS 1401 12/03/2020 >>>> [ +0.000004] Call Trace: >>>> [ +0.000004] <TASK> >>>> [ +0.000003] dump_stack_lvl+0x76/0xa0 >>>> [ +0.000010] print_report+0xce/0x600 >>>> [ +0.000009] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] >>>> [ +0.000790] ? srso_return_thunk+0x5/0x5f >>>> [ +0.000007] ? kasan_complete_mode_report_info+0x76/0x200 >>>> [ +0.000008] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] >>>> [ +0.000684] kasan_report+0xbe/0x110 >>>> [ +0.000007] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] >>>> [ +0.000601] __asan_report_store8_noabort+0x17/0x30 >>>> [ +0.000007] amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] >>>> [ +0.000801] ? __pfx_amdgpu_userq_mgr_fini+0x10/0x10 [amdgpu] >>>> [ +0.000819] ? srso_return_thunk+0x5/0x5f >>>> [ +0.000008] amdgpu_drm_release+0xa3/0xe0 [amdgpu] >>>> [ +0.000604] __fput+0x354/0xa90 >>>> [ +0.000010] __fput_sync+0x59/0x80 >>>> [ +0.000005] __x64_sys_close+0x7d/0xe0 >>>> [ +0.000006] x64_sys_call+0x2505/0x26f0 >>>> [ +0.000006] do_syscall_64+0x7c/0x170 >>>> [ +0.000004] ? kasan_record_aux_stack+0xae/0xd0 >>>> [ +0.000005] ? srso_return_thunk+0x5/0x5f >>>> [ +0.000004] ? kmem_cache_free+0x398/0x580 >>>> [ +0.000006] ? __fput+0x543/0xa90 >>>> [ +0.000006] ? srso_return_thunk+0x5/0x5f >>>> [ +0.000004] ? __fput+0x543/0xa90 >>>> [ +0.000004] ? __kasan_check_read+0x11/0x20 >>>> [ +0.000007] ? srso_return_thunk+0x5/0x5f >>>> [ +0.000004] ? __kasan_check_read+0x11/0x20 >>>> [ +0.000003] ? srso_return_thunk+0x5/0x5f >>>> [ +0.000004] ? fpregs_assert_state_consistent+0x21/0xb0 >>>> [ +0.000006] ? srso_return_thunk+0x5/0x5f >>>> [ +0.000004] ? syscall_exit_to_user_mode+0x4e/0x240 >>>> [ +0.000005] ? srso_return_thunk+0x5/0x5f >>>> [ +0.000004] ? do_syscall_64+0x88/0x170 >>>> [ +0.000003] ? srso_return_thunk+0x5/0x5f >>>> [ +0.000004] ? do_syscall_64+0x88/0x170 >>>> [ +0.000004] ? srso_return_thunk+0x5/0x5f >>>> [ +0.000004] ? irqentry_exit+0x43/0x50 >>>> [ +0.000004] ? srso_return_thunk+0x5/0x5f >>>> [ +0.000004] ? exc_page_fault+0x7c/0x110 >>>> [ +0.000006] entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>> [ +0.000005] RIP: 0033:0x7ffff7b14f67 >>>> [ +0.000005] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f >>>> 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 >>>> <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff >>>> [ +0.000004] RSP: 002b:00007fffffffe358 EFLAGS: 00000246 ORIG_RAX: >>>> 0000000000000003 >>>> [ +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: >>>> 00007ffff7b14f67 >>>> [ +0.000003] RDX: 0000000000000000 RSI: 00007ffff7f5755a RDI: >>>> 0000000000000003 >>>> [ +0.000003] RBP: 00007fffffffe380 R08: 0000555555568170 R09: >>>> 0000000000000000 >>>> [ +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: >>>> 00007fffffffe5c8 >>>> [ +0.000003] R13: 00005555555552a9 R14: 0000555555557d48 R15: >>>> 00007ffff7ffd040 >>>> [ +0.000007] </TASK> >>>> >>>> [ +0.000286] Allocated by task 425 on cpu 11 at 29.751192s: >>>> [ +0.000013] kasan_save_stack+0x28/0x60 >>>> [ +0.000008] kasan_save_track+0x18/0x70 >>>> [ +0.000006] kasan_save_alloc_info+0x38/0x60 >>>> [ +0.000006] __kasan_kmalloc+0xc1/0xd0 >>>> [ +0.000005] __kmalloc_cache_noprof+0x1bd/0x430 >>>> [ +0.000006] amdgpu_driver_open_kms+0x172/0x760 [amdgpu] >>>> [ +0.000521] drm_file_alloc+0x569/0x9a0 >>>> [ +0.000008] drm_client_init+0x1b7/0x410 >>>> [ +0.000007] drm_fbdev_client_setup+0x174/0x470 >>>> [ +0.000007] drm_client_setup+0x8a/0xf0 >>>> [ +0.000006] amdgpu_pci_probe+0x50b/0x10d0 [amdgpu] >>>> [ +0.000482] local_pci_probe+0xe7/0x1b0 >>>> [ +0.000008] pci_device_probe+0x5bf/0x890 >>>> [ +0.000005] really_probe+0x1fd/0x950 >>>> [ +0.000007] __driver_probe_device+0x307/0x410 >>>> [ +0.000005] driver_probe_device+0x4e/0x150 >>>> [ +0.000006] __driver_attach+0x223/0x510 >>>> [ +0.000005] bus_for_each_dev+0x102/0x1a0 >>>> [ +0.000006] driver_attach+0x3d/0x60 >>>> [ +0.000005] bus_add_driver+0x309/0x650 >>>> [ +0.000005] driver_register+0x13d/0x490 >>>> [ +0.000006] __pci_register_driver+0x1ee/0x2b0 >>>> [ +0.000006] xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo] >>>> [ +0.000008] do_one_initcall+0x9c/0x3e0 >>>> [ +0.000007] do_init_module+0x29e/0x7f0 >>>> [ +0.000006] load_module+0x5c75/0x7c80 >>>> [ +0.000006] init_module_from_file+0x106/0x180 >>>> [ +0.000007] idempotent_init_module+0x377/0x740 >>>> [ +0.000006] __x64_sys_finit_module+0xd7/0x180 >>>> [ +0.000006] x64_sys_call+0x1f0b/0x26f0 >>>> [ +0.000006] do_syscall_64+0x7c/0x170 >>>> [ +0.000005] entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>> >>>> [ +0.000013] Freed by task 1737 on cpu 9 at 76.455063s: >>>> [ +0.000010] kasan_save_stack+0x28/0x60 >>>> [ +0.000006] kasan_save_track+0x18/0x70 >>>> [ +0.000005] kasan_save_free_info+0x3b/0x60 >>>> [ +0.000006] __kasan_slab_free+0x54/0x80 >>>> [ +0.000005] kfree+0x127/0x470 >>>> [ +0.000006] amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu] >>>> [ +0.000485] drm_file_free.part.0+0x5b1/0xba0 >>>> [ +0.000007] drm_file_free+0x13/0x30 >>>> [ +0.000006] drm_client_release+0x1c4/0x2b0 >>>> [ +0.000006] drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper] >>>> [ +0.000007] put_fb_info+0x97/0xe0 >>>> [ +0.000006] unregister_framebuffer+0x197/0x380 >>>> [ +0.000005] drm_fb_helper_unregister_info+0x94/0x100 >>>> [ +0.000005] drm_fbdev_client_unregister+0x3c/0x80 >>>> [ +0.000007] drm_client_dev_unregister+0x144/0x330 >>>> [ +0.000006] drm_dev_unregister+0x49/0x1b0 >>>> [ +0.000006] drm_dev_unplug+0x4c/0xd0 >>>> [ +0.000006] amdgpu_pci_remove+0x58/0x130 [amdgpu] >>>> [ +0.000482] pci_device_remove+0xae/0x1e0 >>>> [ +0.000006] device_remove+0xc7/0x180 >>>> [ +0.000006] device_release_driver_internal+0x3d4/0x5a0 >>>> [ +0.000007] device_release_driver+0x12/0x20 >>>> [ +0.000006] pci_stop_bus_device+0x104/0x150 >>>> [ +0.000006] pci_stop_and_remove_bus_device_locked+0x1b/0x40 >>>> [ +0.000005] remove_store+0xd7/0xf0 >>>> [ +0.000007] dev_attr_store+0x3f/0x80 >>>> [ +0.000006] sysfs_kf_write+0x125/0x1d0 >>>> [ +0.000005] kernfs_fop_write_iter+0x2ea/0x490 >>>> [ +0.000007] vfs_write+0x90d/0xe70 >>>> [ +0.000006] ksys_write+0x119/0x220 >>>> [ +0.000006] __x64_sys_write+0x72/0xc0 >>>> [ +0.000006] x64_sys_call+0x18ab/0x26f0 >>>> [ +0.000005] do_syscall_64+0x7c/0x170 >>>> [ +0.000005] entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>> >>>> [ +0.000013] The buggy address belongs to the object at ffff88811c068000 >>>> which belongs to the cache kmalloc-rnd-01-4k of size 4096 >>>> [ +0.000016] The buggy address is located 3168 bytes inside of >>>> freed 4096-byte region [ffff88811c068000, ffff88811c069000) >>>> >>>> [ +0.000022] The buggy address belongs to the physical page: >>>> [ +0.000010] page: refcount:0 mapcount:0 mapping:0000000000000000 >>>> index:0xffff88811c06e000 pfn:0x11c068 >>>> [ +0.000006] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 >>>> pincount:0 >>>> [ +0.000006] flags: >>>> 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >>>> [ +0.000007] page_type: f5(slab) >>>> [ +0.000007] raw: 0017ffffc0000040 ffff88810004c140 dead000000000122 >>>> 0000000000000000 >>>> [ +0.000005] raw: ffff88811c06e000 0000000080040002 00000000f5000000 >>>> 0000000000000000 >>>> [ +0.000006] head: 0017ffffc0000040 ffff88810004c140 dead000000000122 >>>> 0000000000000000 >>>> [ +0.000005] head: ffff88811c06e000 0000000080040002 00000000f5000000 >>>> 0000000000000000 >>>> [ +0.000006] head: 0017ffffc0000003 ffffea0004701a01 ffffffffffffffff >>>> 0000000000000000 >>>> [ +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff >>>> 0000000000000000 >>>> [ +0.000004] page dumped because: kasan: bad access detected >>>> >>>> [ +0.000011] Memory state around the buggy address: >>>> [ +0.000009] ffff88811c068b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>>> fb fb >>>> [ +0.000012] ffff88811c068b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>>> fb fb >>>> [ +0.000011] >ffff88811c068c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>>> fb fb >>>> [ +0.000011] ^ >>>> [ +0.000010] ffff88811c068c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>>> fb fb >>>> [ +0.000011] ffff88811c068d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>>> fb fb >>>> [ +0.000011] >>>> ================================================================== >>>> >>>> Cc: Alex Deucher <alexander.deuc...@amd.com> >>>> Cc: Christian König <christian.koe...@amd.com> >>>> Cc: Lijo Lazar <lijo.la...@amd.com> >>>> Cc: Jesse Zhang <jesse.zh...@amd.com> >>>> Cc: Arvind Yadav <arvind.ya...@amd.com> >>> Please add: >>> Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream >>> Error") >>> >>> Reviewed-by: Alex Deucher <alexander.deuc...@amd.com> >> Sorry, I was on vacation last week. >> >> Clear NAK to that here, this will massively break things! > > Could you clarify what issues are introduced by this patch? I validated it > using NV31 user queues and didn’t observe any problems
Without this code when an application crashes drm_release() will first free up all memory and then disable the queues. It is mandatory that we do it the other way around. Regards, Christian. > > Thanks, Vitaly > >> Regards, >> Christian. >> >> >>>> v2: drop amdgpu_drm_release() and assign drm_release() >>>> as the callback directly.(Alex) >>>> >>>> Signed-off-by: Vitaly Prosyak <vitaly.pros...@amd.com> >>>> Change-Id: I7ab41fd785dcd6f2651c726337ca07519ceae574 >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 +--------------- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++ >>>> 2 files changed, 4 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> index 7fd233f160bf..818a49e69aea 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> @@ -2912,20 +2912,6 @@ static int amdgpu_pmops_runtime_idle(struct device >>>> *dev) >>>> return ret; >>>> } >>>> >>>> -static int amdgpu_drm_release(struct inode *inode, struct file *filp) >>>> -{ >>>> - struct drm_file *file_priv = filp->private_data; >>>> - struct amdgpu_fpriv *fpriv = file_priv->driver_priv; >>>> - >>>> - if (fpriv) { >>>> - fpriv->evf_mgr.fd_closing = true; >>>> - amdgpu_eviction_fence_destroy(&fpriv->evf_mgr); >>>> - amdgpu_userq_mgr_fini(&fpriv->userq_mgr); >>>> - } >>>> - >>>> - return drm_release(inode, filp); >>>> -} >>>> - >>>> long amdgpu_drm_ioctl(struct file *filp, >>>> unsigned int cmd, unsigned long arg) >>>> { >>>> @@ -2977,7 +2963,7 @@ static const struct file_operations >>>> amdgpu_driver_kms_fops = { >>>> .owner = THIS_MODULE, >>>> .open = drm_open, >>>> .flush = amdgpu_flush, >>>> - .release = amdgpu_drm_release, >>>> + .release = drm_release, >>>> .unlocked_ioctl = amdgpu_drm_ioctl, >>>> .mmap = drm_gem_mmap, >>>> .poll = drm_poll, >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> index d2ce7d86dbc8..195ed81d39ff 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> @@ -1501,6 +1501,9 @@ void amdgpu_driver_postclose_kms(struct drm_device >>>> *dev, >>>> amdgpu_vm_bo_del(adev, fpriv->prt_va); >>>> amdgpu_bo_unreserve(pd); >>>> } >>>> + fpriv->evf_mgr.fd_closing = true; >>>> + amdgpu_eviction_fence_destroy(&fpriv->evf_mgr); >>>> + amdgpu_userq_mgr_fini(&fpriv->userq_mgr); >>>> >>>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr); >>>> amdgpu_vm_fini(adev, &fpriv->vm); >>>> -- >>>> 2.34.1 >>>>