On 02.07.25 18:40, Philip Yang wrote:
> 
> On 2025-07-01 03:28, Christian König wrote:
>> Clear NAK to removing this!
>>
>> The amdgpu_flush function is vital for correct operation.
> no fflush call from libdrm/amdgpu, so amdgpu_flush is only called from fclose 
> -> filp_flush
>> The intention is to block closing the file handle in child processes and 
>> wait for all previous operations to complete.
> 
> Child process cannot share amdgpu vm with parent process, child process 
> should open drm file node to create and use new amdgpu vm. Can you elaborate 
> the intention why child process close the inherited drm file handle to wait 
> for parent process operations done?

No, that goes to deep and I really don't have time for that.

Regards,
Christian.

> 
> Regards,
> 
> Philip
> 
>>
>> Regards,
>> Christian.
>>
>> On 01.07.25 07:35, YuanShang Mao (River) wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> @Yang, Philip
>>>> I notice KFD has another different issue with fclose -> amdgpu_flush,
>>>> that fork evict parent process queues when child process close the
>>>> inherited drm node file handle, amdgpu_flush will signal parent process
>>>> KFD eviction fence added to vm root bo resv, this cause performance drop
>>>> if python application uses lots of popen.
>>> Yes. Closing inherited drm node file handle will evict parent process 
>>> queues, since drm share  vm with kfd.
>>>
>>>> function amdgpu_ctx_mgr_entity_flush is only called by amdgpu_flush, can
>>>> be removed too.
>>> Sure. If we decide to remove amdgpu_flush.
>>>
>>> @Koenig, Christian @Deucher, Alexander, do you have any concern on removal 
>>> of amdgpu_flush?
>>>
>>> Thanks
>>> River
>>>
>>>
>>> -----Original Message-----
>>> From: Yang, Philip <philip.y...@amd.com>
>>> Sent: Friday, June 27, 2025 10:44 PM
>>> To: YuanShang Mao (River) <yuanshang....@amd.com>; 
>>> amd-gfx@lists.freedesktop.org
>>> Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; cao, lin <lin....@amd.com>; 
>>> Deng, Emily <emily.d...@amd.com>; Deucher, Alexander 
>>> <alexander.deuc...@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: delete function amdgpu_flush
>>>
>>>
>>> On 2025-06-27 01:20, YuanShang Mao (River) wrote:
>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>
>>>> Currently, amdgpu_flush is used to prevent new jobs from being submitted 
>>>> in the same context when a file descriptor is closed and to wait for 
>>>> existing jobs to complete. Additionally, if the current process is in an 
>>>> exit state and the latest job of the entity was submitted by this process, 
>>>> the entity is terminated.
>>>>
>>>> There is an issue where, if drm scheduler is not woken up for some reason, 
>>>> the amdgpu_flush will remain hung, and another process holding this file 
>>>> cannot submit a job to wake up the drm scheduler.
>>> I notice KFD has another different issue with fclose -> amdgpu_flush,
>>> that fork evict parent process queues when child process close the
>>> inherited drm node file handle, amdgpu_flush will signal parent process
>>> KFD eviction fence added to vm root bo resv, this cause performance drop
>>> if python application uses lots of popen.
>>>
>>> [677852.634569] amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu]
>>> [677852.634814]  __dma_fence_enable_signaling+0x3e/0xe0
>>> [677852.634820]  dma_fence_wait_timeout+0x3a/0x140
>>> [677852.634825]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
>>> [677852.634831]  amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu]
>>> [677852.635026]  amdgpu_flush+0x34/0x50 [amdgpu]
>>> [677852.635208]  filp_flush+0x38/0x90
>>> [677852.635213]  filp_close+0x14/0x30
>>> [677852.635216]  do_close_on_exec+0xdd/0x130
>>> [677852.635221]  begin_new_exec+0x1da/0x490
>>> [677852.635225]  load_elf_binary+0x307/0xea0
>>> [677852.635231]  ? srso_alias_return_thunk+0x5/0xfbef5
>>> [677852.635235]  ? ima_bprm_check+0xa2/0xd0
>>> [677852.635240]  search_binary_handler+0xda/0x260
>>> [677852.635245]  exec_binprm+0x58/0x1a0
>>> [677852.635249]  bprm_execve.part.0+0x16f/0x210
>>> [677852.635254]  bprm_execve+0x45/0x80
>>> [677852.635257]  do_execveat_common.isra.0+0x190/0x200
>>>
>>>> The intended purpose of the flush operation in linux is to flush the 
>>>> content written by the current process to the hardware, rather than 
>>>> shutting down related services upon the process's exit, which would 
>>>> prevent other processes from using them. Now, amdgpu_flush cannot execute 
>>>> concurrently with command submission ioctl, which also leads to 
>>>> performance degradation.
>>> fclose -> filp_flush -> fput, if fput release the last reference of drm
>>> node file handle, call amdgpu_driver_postclose_kms ->
>>> amdgpu_ctx_mgr_fini will flush the entities, so amdgpu_flush is not needed.
>>>
>>> I thought to add workaround to skip amdgpu_flush if (vm->task_info->tgid
>>> != current->group_leader->pid) for KFD, this patch will fix both gfx and
>>> KFD, one stone for two birds.
>>>
>>> function amdgpu_ctx_mgr_entity_flush is only called by amdgpu_flush, can
>>> be removed too.
>>>
>>> Regards,
>>>
>>> Philip
>>>
>>>> An example of a shared DRM file is when systemd stop the display manager; 
>>>> systemd will close the file descriptor of Xorg that it holds.
>>>>
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: amdgpu_ctx_get: locked by 
>>>> other task times 8811
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: owner stack:
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: task:(sd-rmrf)       state:D stack:0  
>>>>    pid:3407  tgid:3407  ppid:1      flags:0x00004002
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: Call Trace:
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  <TASK>
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  __schedule+0x279/0x6b0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  schedule+0x29/0xd0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  
>>>> amddrm_sched_entity_flush+0x13e/0x270 [amd_sched]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
>>>> __pfx_autoremove_wake_function+0x10/0x10
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  
>>>> amdgpu_ctx_mgr_entity_flush+0xd6/0x200 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  amdgpu_flush+0x29/0x50 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  filp_flush+0x38/0x90
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  filp_close+0x14/0x30
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  __close_range+0x1b0/0x230
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  __x64_sys_close_range+0x17/0x30
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  x64_sys_call+0x1e0f/0x25f0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  do_syscall_64+0x7e/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? __count_memcg_events+0x86/0x160
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
>>>> count_memcg_events.constprop.0+0x2a/0x50
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? handle_mm_fault+0x1df/0x2d0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_user_addr_fault+0x5d5/0x870
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
>>>> irqentry_exit_to_user_mode+0x43/0x250
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? irqentry_exit+0x43/0x50
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? exc_page_fault+0x96/0x1c0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  
>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RIP: 0033:0x762b6df1677b
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RSP: 002b:00007ffdb20ad718 EFLAGS: 
>>>> 00000246 ORIG_RAX: 00000000000001b4
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RAX: ffffffffffffffda RBX: 
>>>> 0000000000000000 RCX: 0000762b6df1677b
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RDX: 0000000000000000 RSI: 
>>>> 000000007fffffff RDI: 0000000000000003
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RBP: 00007ffdb20ad730 R08: 
>>>> 0000000000000000 R09: 0000000000000000
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R10: 0000000000000008 R11: 
>>>> 0000000000000246 R12: 0000000000000007
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R13: 0000000000000000 R14: 
>>>> 0000000000000000 R15: 0000000000000000
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  </TASK>
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: current stack:
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: task:Xorg            state:R  running 
>>>> task     stack:0     pid:2343  tgid:2343  ppid:2341   flags:0x00000008
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: Call Trace:
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  <TASK>
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  sched_show_task+0x122/0x180
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  amdgpu_ctx_get+0xf6/0x120 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  amdgpu_cs_ioctl+0xb6/0x2110 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? update_cfs_group+0x111/0x120
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? enqueue_entity+0x3a6/0x550
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 
>>>> [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  drm_ioctl_kernel+0xbc/0x120
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  drm_ioctl+0x2f6/0x5b0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 
>>>> [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  __x64_sys_ioctl+0xa3/0xf0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  x64_sys_call+0x11ad/0x25f0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  do_syscall_64+0x7e/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? ksys_read+0xe6/0x100
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? idr_find+0xf/0x20
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? drm_syncobj_array_free+0x5a/0x80
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? drm_syncobj_reset_ioctl+0xbd/0xd0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
>>>> __pfx_drm_syncobj_reset_ioctl+0x10/0x10
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? drm_ioctl_kernel+0xbc/0x120
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
>>>> __check_object_size.part.0+0x3a/0x150
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? _copy_to_user+0x41/0x60
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? drm_ioctl+0x326/0x5b0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
>>>> __pfx_drm_syncobj_reset_ioctl+0x10/0x10
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? kvm_clock_get_cycles+0x18/0x40
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? __pm_runtime_suspend+0x7b/0xd0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? amdgpu_drm_ioctl+0x70/0x90 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? __x64_sys_ioctl+0xbb/0xf0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
>>>> syscall_exit_to_user_mode+0x4e/0x250
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_syscall_64+0x8a/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_syscall_64+0x8a/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
>>>> syscall_exit_to_user_mode+0x4e/0x250
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_syscall_64+0x8a/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_syscall_64+0x8a/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_syscall_64+0x8a/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
>>>> sysvec_apic_timer_interrupt+0x57/0xc0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  
>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RIP: 0033:0x7156c3524ded
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: Code: 04 25 28 00 00 00 48 89 45 c8 
>>>> 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 
>>>> b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 
>>>> 25 28 00 00 00
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RSP: 002b:00007ffe4afcc410 EFLAGS: 
>>>> 00000246 ORIG_RAX: 0000000000000010
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RAX: ffffffffffffffda RBX: 
>>>> 0000578954b74cf8 RCX: 00007156c3524ded
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RDX: 00007ffe4afcc4f0 RSI: 
>>>> 00000000c0186444 RDI: 0000000000000012
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RBP: 00007ffe4afcc460 R08: 
>>>> 00007ffe4afcc7a0 R09: 00007ffe4afcc4b0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R10: 0000578954b862f0 R11: 
>>>> 0000000000000246 R12: 00000000c0186444
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R13: 0000000000000012 R14: 
>>>> 0000000000000060 R15: 0000578954b46380
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel:  </TASK>
>>>>
>>>> Signed-off-by: YuanShang <yuanshang....@amd.com>
>>>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 -------------
>>>>    1 file changed, 13 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 2bb02fe9c880..ee6b59bfd798 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -2947,22 +2947,9 @@ static const struct dev_pm_ops amdgpu_pm_ops = {
>>>>           .runtime_idle = amdgpu_pmops_runtime_idle,  };
>>>>
>>>> -static int amdgpu_flush(struct file *f, fl_owner_t id) -{
>>>> -       struct drm_file *file_priv = f->private_data;
>>>> -       struct amdgpu_fpriv *fpriv = file_priv->driver_priv;
>>>> -       long timeout = MAX_WAIT_SCHED_ENTITY_Q_EMPTY;
>>>> -
>>>> -       timeout = amdgpu_ctx_mgr_entity_flush(&fpriv->ctx_mgr, timeout);
>>>> -       timeout = amdgpu_vm_wait_idle(&fpriv->vm, timeout);
>>>> -
>>>> -       return timeout >= 0 ? 0 : timeout;
>>>> -}
>>>> -
>>>>    static const struct file_operations amdgpu_driver_kms_fops = {
>>>>           .owner = THIS_MODULE,
>>>>           .open = drm_open,
>>>> -       .flush = amdgpu_flush,
>>>>           .release = drm_release,
>>>>           .unlocked_ioctl = amdgpu_drm_ioctl,
>>>>           .mmap = drm_gem_mmap,
>>>> -- 
>>>> 2.25.1
>>>>

Reply via email to