[AMD Official Use Only - AMD Internal Distribution Only]

@Yang, Philip
>I notice KFD has another different issue with fclose -> amdgpu_flush,
>that fork evict parent process queues when child process close the
>inherited drm node file handle, amdgpu_flush will signal parent process
>KFD eviction fence added to vm root bo resv, this cause performance drop
>if python application uses lots of popen.

Yes. Closing inherited drm node file handle will evict parent process queues, 
since drm share  vm with kfd.

>function amdgpu_ctx_mgr_entity_flush is only called by amdgpu_flush, can
>be removed too.

Sure. If we decide to remove amdgpu_flush.

@Koenig, Christian @Deucher, Alexander, do you have any concern on removal of 
amdgpu_flush?

Thanks
River


-----Original Message-----
From: Yang, Philip <philip.y...@amd.com>
Sent: Friday, June 27, 2025 10:44 PM
To: YuanShang Mao (River) <yuanshang....@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; cao, lin <lin....@amd.com>; 
Deng, Emily <emily.d...@amd.com>; Deucher, Alexander <alexander.deuc...@amd.com>
Subject: Re: [PATCH] drm/amdgpu: delete function amdgpu_flush


On 2025-06-27 01:20, YuanShang Mao (River) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Currently, amdgpu_flush is used to prevent new jobs from being submitted in 
> the same context when a file descriptor is closed and to wait for existing 
> jobs to complete. Additionally, if the current process is in an exit state 
> and the latest job of the entity was submitted by this process, the entity is 
> terminated.
>
> There is an issue where, if drm scheduler is not woken up for some reason, 
> the amdgpu_flush will remain hung, and another process holding this file 
> cannot submit a job to wake up the drm scheduler.

I notice KFD has another different issue with fclose -> amdgpu_flush,
that fork evict parent process queues when child process close the
inherited drm node file handle, amdgpu_flush will signal parent process
KFD eviction fence added to vm root bo resv, this cause performance drop
if python application uses lots of popen.

[677852.634569] amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu]
[677852.634814]  __dma_fence_enable_signaling+0x3e/0xe0
[677852.634820]  dma_fence_wait_timeout+0x3a/0x140
[677852.634825]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[677852.634831]  amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu]
[677852.635026]  amdgpu_flush+0x34/0x50 [amdgpu]
[677852.635208]  filp_flush+0x38/0x90
[677852.635213]  filp_close+0x14/0x30
[677852.635216]  do_close_on_exec+0xdd/0x130
[677852.635221]  begin_new_exec+0x1da/0x490
[677852.635225]  load_elf_binary+0x307/0xea0
[677852.635231]  ? srso_alias_return_thunk+0x5/0xfbef5
[677852.635235]  ? ima_bprm_check+0xa2/0xd0
[677852.635240]  search_binary_handler+0xda/0x260
[677852.635245]  exec_binprm+0x58/0x1a0
[677852.635249]  bprm_execve.part.0+0x16f/0x210
[677852.635254]  bprm_execve+0x45/0x80
[677852.635257]  do_execveat_common.isra.0+0x190/0x200

>
> The intended purpose of the flush operation in linux is to flush the content 
> written by the current process to the hardware, rather than shutting down 
> related services upon the process's exit, which would prevent other processes 
> from using them. Now, amdgpu_flush cannot execute concurrently with command 
> submission ioctl, which also leads to performance degradation.

fclose -> filp_flush -> fput, if fput release the last reference of drm
node file handle, call amdgpu_driver_postclose_kms ->
amdgpu_ctx_mgr_fini will flush the entities, so amdgpu_flush is not needed.

I thought to add workaround to skip amdgpu_flush if (vm->task_info->tgid
!= current->group_leader->pid) for KFD, this patch will fix both gfx and
KFD, one stone for two birds.

function amdgpu_ctx_mgr_entity_flush is only called by amdgpu_flush, can
be removed too.

Regards,

Philip

>
> An example of a shared DRM file is when systemd stop the display manager; 
> systemd will close the file descriptor of Xorg that it holds.
>
> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: amdgpu_ctx_get: locked by other 
> task times 8811
> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: owner stack:
> Jun 11 16:24:24 ubuntu2404-2 kernel: task:(sd-rmrf)       state:D stack:0     
> pid:3407  tgid:3407  ppid:1      flags:0x00004002
> Jun 11 16:24:24 ubuntu2404-2 kernel: Call Trace:
> Jun 11 16:24:24 ubuntu2404-2 kernel:  <TASK>
> Jun 11 16:24:24 ubuntu2404-2 kernel:  __schedule+0x279/0x6b0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  schedule+0x29/0xd0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  amddrm_sched_entity_flush+0x13e/0x270 
> [amd_sched]
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
> __pfx_autoremove_wake_function+0x10/0x10
> Jun 11 16:24:24 ubuntu2404-2 kernel:  amdgpu_ctx_mgr_entity_flush+0xd6/0x200 
> [amdgpu]
> Jun 11 16:24:24 ubuntu2404-2 kernel:  amdgpu_flush+0x29/0x50 [amdgpu]
> Jun 11 16:24:24 ubuntu2404-2 kernel:  filp_flush+0x38/0x90
> Jun 11 16:24:24 ubuntu2404-2 kernel:  filp_close+0x14/0x30
> Jun 11 16:24:24 ubuntu2404-2 kernel:  __close_range+0x1b0/0x230
> Jun 11 16:24:24 ubuntu2404-2 kernel:  __x64_sys_close_range+0x17/0x30
> Jun 11 16:24:24 ubuntu2404-2 kernel:  x64_sys_call+0x1e0f/0x25f0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  do_syscall_64+0x7e/0x170
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? __count_memcg_events+0x86/0x160
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
> count_memcg_events.constprop.0+0x2a/0x50
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? handle_mm_fault+0x1df/0x2d0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_user_addr_fault+0x5d5/0x870
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? irqentry_exit_to_user_mode+0x43/0x250
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? irqentry_exit+0x43/0x50
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? exc_page_fault+0x96/0x1c0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> Jun 11 16:24:24 ubuntu2404-2 kernel: RIP: 0033:0x762b6df1677b
> Jun 11 16:24:24 ubuntu2404-2 kernel: RSP: 002b:00007ffdb20ad718 EFLAGS: 
> 00000246 ORIG_RAX: 00000000000001b4
> Jun 11 16:24:24 ubuntu2404-2 kernel: RAX: ffffffffffffffda RBX: 
> 0000000000000000 RCX: 0000762b6df1677b
> Jun 11 16:24:24 ubuntu2404-2 kernel: RDX: 0000000000000000 RSI: 
> 000000007fffffff RDI: 0000000000000003
> Jun 11 16:24:24 ubuntu2404-2 kernel: RBP: 00007ffdb20ad730 R08: 
> 0000000000000000 R09: 0000000000000000
> Jun 11 16:24:24 ubuntu2404-2 kernel: R10: 0000000000000008 R11: 
> 0000000000000246 R12: 0000000000000007
> Jun 11 16:24:24 ubuntu2404-2 kernel: R13: 0000000000000000 R14: 
> 0000000000000000 R15: 0000000000000000
> Jun 11 16:24:24 ubuntu2404-2 kernel:  </TASK>
> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: current stack:
> Jun 11 16:24:24 ubuntu2404-2 kernel: task:Xorg            state:R  running 
> task     stack:0     pid:2343  tgid:2343  ppid:2341   flags:0x00000008
> Jun 11 16:24:24 ubuntu2404-2 kernel: Call Trace:
> Jun 11 16:24:24 ubuntu2404-2 kernel:  <TASK>
> Jun 11 16:24:24 ubuntu2404-2 kernel:  sched_show_task+0x122/0x180
> Jun 11 16:24:24 ubuntu2404-2 kernel:  amdgpu_ctx_get+0xf6/0x120 [amdgpu]
> Jun 11 16:24:24 ubuntu2404-2 kernel:  amdgpu_cs_ioctl+0xb6/0x2110 [amdgpu]
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? update_cfs_group+0x111/0x120
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? enqueue_entity+0x3a6/0x550
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 
> [amdgpu]
> Jun 11 16:24:24 ubuntu2404-2 kernel:  drm_ioctl_kernel+0xbc/0x120
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  drm_ioctl+0x2f6/0x5b0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 
> [amdgpu]
> Jun 11 16:24:24 ubuntu2404-2 kernel:  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
> Jun 11 16:24:24 ubuntu2404-2 kernel:  __x64_sys_ioctl+0xa3/0xf0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  x64_sys_call+0x11ad/0x25f0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  do_syscall_64+0x7e/0x170
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? ksys_read+0xe6/0x100
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? idr_find+0xf/0x20
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? drm_syncobj_array_free+0x5a/0x80
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? drm_syncobj_reset_ioctl+0xbd/0xd0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
> __pfx_drm_syncobj_reset_ioctl+0x10/0x10
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? drm_ioctl_kernel+0xbc/0x120
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? __check_object_size.part.0+0x3a/0x150
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? _copy_to_user+0x41/0x60
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? drm_ioctl+0x326/0x5b0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? 
> __pfx_drm_syncobj_reset_ioctl+0x10/0x10
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? kvm_clock_get_cycles+0x18/0x40
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? __pm_runtime_suspend+0x7b/0xd0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? amdgpu_drm_ioctl+0x70/0x90 [amdgpu]
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? __x64_sys_ioctl+0xbb/0xf0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? syscall_exit_to_user_mode+0x4e/0x250
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_syscall_64+0x8a/0x170
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_syscall_64+0x8a/0x170
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? syscall_exit_to_user_mode+0x4e/0x250
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_syscall_64+0x8a/0x170
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_syscall_64+0x8a/0x170
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? srso_return_thunk+0x5/0x5f
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? do_syscall_64+0x8a/0x170
> Jun 11 16:24:24 ubuntu2404-2 kernel:  ? sysvec_apic_timer_interrupt+0x57/0xc0
> Jun 11 16:24:24 ubuntu2404-2 kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> Jun 11 16:24:24 ubuntu2404-2 kernel: RIP: 0033:0x7156c3524ded
> Jun 11 16:24:24 ubuntu2404-2 kernel: Code: 04 25 28 00 00 00 48 89 45 c8 31 
> c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 
> 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 
> 00 00
> Jun 11 16:24:24 ubuntu2404-2 kernel: RSP: 002b:00007ffe4afcc410 EFLAGS: 
> 00000246 ORIG_RAX: 0000000000000010
> Jun 11 16:24:24 ubuntu2404-2 kernel: RAX: ffffffffffffffda RBX: 
> 0000578954b74cf8 RCX: 00007156c3524ded
> Jun 11 16:24:24 ubuntu2404-2 kernel: RDX: 00007ffe4afcc4f0 RSI: 
> 00000000c0186444 RDI: 0000000000000012
> Jun 11 16:24:24 ubuntu2404-2 kernel: RBP: 00007ffe4afcc460 R08: 
> 00007ffe4afcc7a0 R09: 00007ffe4afcc4b0
> Jun 11 16:24:24 ubuntu2404-2 kernel: R10: 0000578954b862f0 R11: 
> 0000000000000246 R12: 00000000c0186444
> Jun 11 16:24:24 ubuntu2404-2 kernel: R13: 0000000000000012 R14: 
> 0000000000000060 R15: 0000578954b46380
> Jun 11 16:24:24 ubuntu2404-2 kernel:  </TASK>
>
> Signed-off-by: YuanShang <yuanshang....@amd.com>
>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 -------------
>   1 file changed, 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 2bb02fe9c880..ee6b59bfd798 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2947,22 +2947,9 @@ static const struct dev_pm_ops amdgpu_pm_ops = {
>          .runtime_idle = amdgpu_pmops_runtime_idle,  };
>
> -static int amdgpu_flush(struct file *f, fl_owner_t id) -{
> -       struct drm_file *file_priv = f->private_data;
> -       struct amdgpu_fpriv *fpriv = file_priv->driver_priv;
> -       long timeout = MAX_WAIT_SCHED_ENTITY_Q_EMPTY;
> -
> -       timeout = amdgpu_ctx_mgr_entity_flush(&fpriv->ctx_mgr, timeout);
> -       timeout = amdgpu_vm_wait_idle(&fpriv->vm, timeout);
> -
> -       return timeout >= 0 ? 0 : timeout;
> -}
> -
>   static const struct file_operations amdgpu_driver_kms_fops = {
>          .owner = THIS_MODULE,
>          .open = drm_open,
> -       .flush = amdgpu_flush,
>          .release = drm_release,
>          .unlocked_ioctl = amdgpu_drm_ioctl,
>          .mmap = drm_gem_mmap,
> --
> 2.25.1
>

Reply via email to