On Thu, Apr 10, 2025 at 8:35 PM Christian König <christian.koe...@amd.com> wrote: > > Am 10.04.25 um 11:33 schrieb Qiang Yu: > > On Tue, Apr 8, 2025 at 11:48 PM Erico Nunes <nunes.er...@gmail.com> wrote: > >> With this callback implemented, a terminating application will wait for > >> the sched entity to be flushed out to the hardware and cancel all other > >> pending jobs before destroying its context. > > We do flush when file release in lima_ctx_mgr_fini. Why do we wait here > > in flush? What's the difference? > > Waiting for submissions when you release the file descriptor is actually a > bad idea since that can prevent SIGKILL from acting immediately. For example > the OOM killer absolutely doesn't like that and eventually calls panic(). > > Flush is called either manually, on process termination or when you send a > SIGTERM. This should then wait for any I/O to complete. > > The idea is now that you can then still send a SIGKILL to abort waiting for > the I/O as well and so get pending GPU operations not submitted to the HW. > > The DRM scheduler helps doing that by providing the drm_sched_entity_flush() > and drm_sched_entity_fini() functions. > > When there is still pending work when drm_sched_entity_fini() is called a > callback to kill it is installed and the job just freed instead of executed. > So the difference is we can always wait in flush, but better not in release when SIGKILL?
> > > >> This prevents applications with multiple contexts from running into a > >> race condition between running tasks and context destroy when > >> terminating. > >> If implementing flush callback fix the problem, it must not be when SIGKILL. Could you describe the problem and how this fix solves it? As I don't understand how the above difference could fix a race bug. > >> Signed-off-by: Erico Nunes <nunes.er...@gmail.com> > >> --- > >> drivers/gpu/drm/lima/lima_ctx.c | 18 ++++++++++++++++++ > >> drivers/gpu/drm/lima/lima_ctx.h | 1 + > >> drivers/gpu/drm/lima/lima_drv.c | 17 ++++++++++++++++- > >> 3 files changed, 35 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/lima/lima_ctx.c > >> b/drivers/gpu/drm/lima/lima_ctx.c > >> index 0e668fc1e0f9..e8fb5788ca69 100644 > >> --- a/drivers/gpu/drm/lima/lima_ctx.c > >> +++ b/drivers/gpu/drm/lima/lima_ctx.c > >> @@ -100,3 +100,21 @@ void lima_ctx_mgr_fini(struct lima_ctx_mgr *mgr) > >> xa_destroy(&mgr->handles); > >> mutex_destroy(&mgr->lock); > >> } > >> + > >> +long lima_ctx_mgr_flush(struct lima_ctx_mgr *mgr, long timeout) > >> +{ > >> + struct lima_ctx *ctx; > >> + unsigned long id; > >> + > >> + mutex_lock(&mgr->lock); > >> + xa_for_each(&mgr->handles, id, ctx) { > >> + for (int i = 0; i < lima_pipe_num; i++) { > >> + struct lima_sched_context *context = > >> &ctx->context[i]; > >> + struct drm_sched_entity *entity = &context->base; > >> + > >> + timeout = drm_sched_entity_flush(entity, timeout); > >> + } > >> + } > >> + mutex_unlock(&mgr->lock); > >> + return timeout; > >> +} > >> diff --git a/drivers/gpu/drm/lima/lima_ctx.h > >> b/drivers/gpu/drm/lima/lima_ctx.h > >> index 5b1063ce968b..ff133db6ae4c 100644 > >> --- a/drivers/gpu/drm/lima/lima_ctx.h > >> +++ b/drivers/gpu/drm/lima/lima_ctx.h > >> @@ -30,5 +30,6 @@ struct lima_ctx *lima_ctx_get(struct lima_ctx_mgr *mgr, > >> u32 id); > >> void lima_ctx_put(struct lima_ctx *ctx); > >> void lima_ctx_mgr_init(struct lima_ctx_mgr *mgr); > >> void lima_ctx_mgr_fini(struct lima_ctx_mgr *mgr); > >> +long lima_ctx_mgr_flush(struct lima_ctx_mgr *mgr, long timeout); > >> > >> #endif > >> diff --git a/drivers/gpu/drm/lima/lima_drv.c > >> b/drivers/gpu/drm/lima/lima_drv.c > >> index 11ace5cebf4c..08169b0d9c28 100644 > >> --- a/drivers/gpu/drm/lima/lima_drv.c > >> +++ b/drivers/gpu/drm/lima/lima_drv.c > >> @@ -254,7 +254,22 @@ static const struct drm_ioctl_desc > >> lima_drm_driver_ioctls[] = { > >> DRM_IOCTL_DEF_DRV(LIMA_CTX_FREE, lima_ioctl_ctx_free, > >> DRM_RENDER_ALLOW), > >> }; > >> > >> -DEFINE_DRM_GEM_FOPS(lima_drm_driver_fops); > >> +static int lima_drm_driver_flush(struct file *filp, fl_owner_t id) > >> +{ > >> + struct drm_file *file = filp->private_data; > >> + struct lima_drm_priv *priv = file->driver_priv; > >> + long timeout = MAX_WAIT_SCHED_ENTITY_Q_EMPTY; > >> + > >> + timeout = lima_ctx_mgr_flush(&priv->ctx_mgr, timeout); > >> + > >> + return timeout >= 0 ? 0 : timeout; > >> +} > >> + > >> +static const struct file_operations lima_drm_driver_fops = { > >> + .owner = THIS_MODULE, > >> + .flush = lima_drm_driver_flush, > >> + DRM_GEM_FOPS, > >> +}; > >> > >> /* > >> * Changelog: > >> -- > >> 2.49.0 > >> >