On 02.07.25 18:40, Philip Yang wrote: > > On 2025-07-01 03:28, Christian König wrote: >> Clear NAK to removing this! >> >> The amdgpu_flush function is vital for correct operation. > no fflush call from libdrm/amdgpu, so amdgpu_flush is only called from fclose > -> filp_flush >> The intention is to block closing the file handle in child processes and >> wait for all previous operations to complete. > > Child process cannot share amdgpu vm with parent process, child process > should open drm file node to create and use new amdgpu vm. Can you elaborate > the intention why child process close the inherited drm file handle to wait > for parent process operations done?
No, that goes to deep and I really don't have time for that. Regards, Christian. > > Regards, > > Philip > >> >> Regards, >> Christian. >> >> On 01.07.25 07:35, YuanShang Mao (River) wrote: >>> [AMD Official Use Only - AMD Internal Distribution Only] >>> >>> @Yang, Philip >>>> I notice KFD has another different issue with fclose -> amdgpu_flush, >>>> that fork evict parent process queues when child process close the >>>> inherited drm node file handle, amdgpu_flush will signal parent process >>>> KFD eviction fence added to vm root bo resv, this cause performance drop >>>> if python application uses lots of popen. >>> Yes. Closing inherited drm node file handle will evict parent process >>> queues, since drm share vm with kfd. >>> >>>> function amdgpu_ctx_mgr_entity_flush is only called by amdgpu_flush, can >>>> be removed too. >>> Sure. If we decide to remove amdgpu_flush. >>> >>> @Koenig, Christian @Deucher, Alexander, do you have any concern on removal >>> of amdgpu_flush? >>> >>> Thanks >>> River >>> >>> >>> -----Original Message----- >>> From: Yang, Philip <philip.y...@amd.com> >>> Sent: Friday, June 27, 2025 10:44 PM >>> To: YuanShang Mao (River) <yuanshang....@amd.com>; >>> amd-gfx@lists.freedesktop.org >>> Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; cao, lin <lin....@amd.com>; >>> Deng, Emily <emily.d...@amd.com>; Deucher, Alexander >>> <alexander.deuc...@amd.com> >>> Subject: Re: [PATCH] drm/amdgpu: delete function amdgpu_flush >>> >>> >>> On 2025-06-27 01:20, YuanShang Mao (River) wrote: >>>> [AMD Official Use Only - AMD Internal Distribution Only] >>>> >>>> Currently, amdgpu_flush is used to prevent new jobs from being submitted >>>> in the same context when a file descriptor is closed and to wait for >>>> existing jobs to complete. Additionally, if the current process is in an >>>> exit state and the latest job of the entity was submitted by this process, >>>> the entity is terminated. >>>> >>>> There is an issue where, if drm scheduler is not woken up for some reason, >>>> the amdgpu_flush will remain hung, and another process holding this file >>>> cannot submit a job to wake up the drm scheduler. >>> I notice KFD has another different issue with fclose -> amdgpu_flush, >>> that fork evict parent process queues when child process close the >>> inherited drm node file handle, amdgpu_flush will signal parent process >>> KFD eviction fence added to vm root bo resv, this cause performance drop >>> if python application uses lots of popen. >>> >>> [677852.634569] amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu] >>> [677852.634814] __dma_fence_enable_signaling+0x3e/0xe0 >>> [677852.634820] dma_fence_wait_timeout+0x3a/0x140 >>> [677852.634825] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl] >>> [677852.634831] amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu] >>> [677852.635026] amdgpu_flush+0x34/0x50 [amdgpu] >>> [677852.635208] filp_flush+0x38/0x90 >>> [677852.635213] filp_close+0x14/0x30 >>> [677852.635216] do_close_on_exec+0xdd/0x130 >>> [677852.635221] begin_new_exec+0x1da/0x490 >>> [677852.635225] load_elf_binary+0x307/0xea0 >>> [677852.635231] ? srso_alias_return_thunk+0x5/0xfbef5 >>> [677852.635235] ? ima_bprm_check+0xa2/0xd0 >>> [677852.635240] search_binary_handler+0xda/0x260 >>> [677852.635245] exec_binprm+0x58/0x1a0 >>> [677852.635249] bprm_execve.part.0+0x16f/0x210 >>> [677852.635254] bprm_execve+0x45/0x80 >>> [677852.635257] do_execveat_common.isra.0+0x190/0x200 >>> >>>> The intended purpose of the flush operation in linux is to flush the >>>> content written by the current process to the hardware, rather than >>>> shutting down related services upon the process's exit, which would >>>> prevent other processes from using them. Now, amdgpu_flush cannot execute >>>> concurrently with command submission ioctl, which also leads to >>>> performance degradation. >>> fclose -> filp_flush -> fput, if fput release the last reference of drm >>> node file handle, call amdgpu_driver_postclose_kms -> >>> amdgpu_ctx_mgr_fini will flush the entities, so amdgpu_flush is not needed. >>> >>> I thought to add workaround to skip amdgpu_flush if (vm->task_info->tgid >>> != current->group_leader->pid) for KFD, this patch will fix both gfx and >>> KFD, one stone for two birds. >>> >>> function amdgpu_ctx_mgr_entity_flush is only called by amdgpu_flush, can >>> be removed too. >>> >>> Regards, >>> >>> Philip >>> >>>> An example of a shared DRM file is when systemd stop the display manager; >>>> systemd will close the file descriptor of Xorg that it holds. >>>> >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: amdgpu_ctx_get: locked by >>>> other task times 8811 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: owner stack: >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: task:(sd-rmrf) state:D stack:0 >>>> pid:3407 tgid:3407 ppid:1 flags:0x00004002 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: Call Trace: >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: <TASK> >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: __schedule+0x279/0x6b0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: schedule+0x29/0xd0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: >>>> amddrm_sched_entity_flush+0x13e/0x270 [amd_sched] >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? >>>> __pfx_autoremove_wake_function+0x10/0x10 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: >>>> amdgpu_ctx_mgr_entity_flush+0xd6/0x200 [amdgpu] >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_flush+0x29/0x50 [amdgpu] >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: filp_flush+0x38/0x90 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: filp_close+0x14/0x30 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: __close_range+0x1b0/0x230 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: __x64_sys_close_range+0x17/0x30 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: x64_sys_call+0x1e0f/0x25f0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: do_syscall_64+0x7e/0x170 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __count_memcg_events+0x86/0x160 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? >>>> count_memcg_events.constprop.0+0x2a/0x50 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? handle_mm_fault+0x1df/0x2d0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_user_addr_fault+0x5d5/0x870 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? >>>> irqentry_exit_to_user_mode+0x43/0x250 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? irqentry_exit+0x43/0x50 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? exc_page_fault+0x96/0x1c0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: >>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RIP: 0033:0x762b6df1677b >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RSP: 002b:00007ffdb20ad718 EFLAGS: >>>> 00000246 ORIG_RAX: 00000000000001b4 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RAX: ffffffffffffffda RBX: >>>> 0000000000000000 RCX: 0000762b6df1677b >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RDX: 0000000000000000 RSI: >>>> 000000007fffffff RDI: 0000000000000003 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RBP: 00007ffdb20ad730 R08: >>>> 0000000000000000 R09: 0000000000000000 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R10: 0000000000000008 R11: >>>> 0000000000000246 R12: 0000000000000007 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R13: 0000000000000000 R14: >>>> 0000000000000000 R15: 0000000000000000 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: </TASK> >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: current stack: >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: task:Xorg state:R running >>>> task stack:0 pid:2343 tgid:2343 ppid:2341 flags:0x00000008 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: Call Trace: >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: <TASK> >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: sched_show_task+0x122/0x180 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_ctx_get+0xf6/0x120 [amdgpu] >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_cs_ioctl+0xb6/0x2110 [amdgpu] >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? update_cfs_group+0x111/0x120 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? enqueue_entity+0x3a6/0x550 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pfx_amdgpu_cs_ioctl+0x10/0x10 >>>> [amdgpu] >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: drm_ioctl_kernel+0xbc/0x120 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: drm_ioctl+0x2f6/0x5b0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pfx_amdgpu_cs_ioctl+0x10/0x10 >>>> [amdgpu] >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_drm_ioctl+0x4e/0x90 [amdgpu] >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: __x64_sys_ioctl+0xa3/0xf0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: x64_sys_call+0x11ad/0x25f0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: do_syscall_64+0x7e/0x170 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? ksys_read+0xe6/0x100 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? idr_find+0xf/0x20 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_syncobj_array_free+0x5a/0x80 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_syncobj_reset_ioctl+0xbd/0xd0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? >>>> __pfx_drm_syncobj_reset_ioctl+0x10/0x10 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_ioctl_kernel+0xbc/0x120 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? >>>> __check_object_size.part.0+0x3a/0x150 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? _copy_to_user+0x41/0x60 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_ioctl+0x326/0x5b0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? >>>> __pfx_drm_syncobj_reset_ioctl+0x10/0x10 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? kvm_clock_get_cycles+0x18/0x40 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pm_runtime_suspend+0x7b/0xd0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? amdgpu_drm_ioctl+0x70/0x90 [amdgpu] >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __x64_sys_ioctl+0xbb/0xf0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? >>>> syscall_exit_to_user_mode+0x4e/0x250 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? >>>> syscall_exit_to_user_mode+0x4e/0x250 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? >>>> sysvec_apic_timer_interrupt+0x57/0xc0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: >>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RIP: 0033:0x7156c3524ded >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: Code: 04 25 28 00 00 00 48 89 45 c8 >>>> 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 >>>> b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 >>>> 25 28 00 00 00 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RSP: 002b:00007ffe4afcc410 EFLAGS: >>>> 00000246 ORIG_RAX: 0000000000000010 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RAX: ffffffffffffffda RBX: >>>> 0000578954b74cf8 RCX: 00007156c3524ded >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RDX: 00007ffe4afcc4f0 RSI: >>>> 00000000c0186444 RDI: 0000000000000012 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RBP: 00007ffe4afcc460 R08: >>>> 00007ffe4afcc7a0 R09: 00007ffe4afcc4b0 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R10: 0000578954b862f0 R11: >>>> 0000000000000246 R12: 00000000c0186444 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R13: 0000000000000012 R14: >>>> 0000000000000060 R15: 0000578954b46380 >>>> Jun 11 16:24:24 ubuntu2404-2 kernel: </TASK> >>>> >>>> Signed-off-by: YuanShang <yuanshang....@amd.com> >>>> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 ------------- >>>> 1 file changed, 13 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> index 2bb02fe9c880..ee6b59bfd798 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> @@ -2947,22 +2947,9 @@ static const struct dev_pm_ops amdgpu_pm_ops = { >>>> .runtime_idle = amdgpu_pmops_runtime_idle, }; >>>> >>>> -static int amdgpu_flush(struct file *f, fl_owner_t id) -{ >>>> - struct drm_file *file_priv = f->private_data; >>>> - struct amdgpu_fpriv *fpriv = file_priv->driver_priv; >>>> - long timeout = MAX_WAIT_SCHED_ENTITY_Q_EMPTY; >>>> - >>>> - timeout = amdgpu_ctx_mgr_entity_flush(&fpriv->ctx_mgr, timeout); >>>> - timeout = amdgpu_vm_wait_idle(&fpriv->vm, timeout); >>>> - >>>> - return timeout >= 0 ? 0 : timeout; >>>> -} >>>> - >>>> static const struct file_operations amdgpu_driver_kms_fops = { >>>> .owner = THIS_MODULE, >>>> .open = drm_open, >>>> - .flush = amdgpu_flush, >>>> .release = drm_release, >>>> .unlocked_ioctl = amdgpu_drm_ioctl, >>>> .mmap = drm_gem_mmap, >>>> -- >>>> 2.25.1 >>>>