On Tue, Jul 08, 2025 at 04:31:31PM +0100, Tvrtko Ursulin wrote: > > On 08/07/2025 14:02, Christian König wrote: > > On 08.07.25 14:54, Tvrtko Ursulin wrote: > > > > > > On 08/07/2025 13:37, Christian König wrote: > > > > On 08.07.25 11:51, Tvrtko Ursulin wrote: > > > > > There is no reason to queue just a single job if scheduler can take > > > > > more > > > > > and re-queue the worker to queue more. > > > > > > > > That's not correct. This was intentionally avoided. > > > > > > > > If more than just the scheduler is using the single threaded workqeueu > > > > other workers, especially the timeout worker, can jump in and execute > > > > first. > > > > > > > > We explicitely removed submitting more than one job in each worker run. > > > > > > I wanted to ask why, but then I had a look to see if anyone actually does > > > this. And I did not find any driver sharing a single threaded workqueue > > > between submit and timeout. > > > > > > The only driver which even passes in the same workqueue for both is PVR, > > > but it is not a single threaded one. > > > > > > Or perhaps I misunderstood what you said. Could you please clarify either > > > way? > > > > You correctly understood that. > > > > The argument was that submitting more than one job in a worker is simply > > not beneficial and other work items can jump in and execute. > > > > I have no idea if that is actually used or not. You would need to dig up > > the discussion when we switched from a kernel thread to work items for the > > full background. > >
I think Christian is capturing the gist of the discussion. I originally had it coded the way Tvrtko did, but got pushback and switched to the requeue approach. If I recall correctly, at the time the default workqueue was a system WQ, which we definitely didn’t want to hog. Now that the default is a dedicated worker, this is less of an issue. However, technically, a system worker could still be passed in—though it shouldn't be, since the WQ should be marked with WQ_RECLAIM. I don’t have a strong opinion either way, so I’m going to stay out of this one. Matt > > But in general to do as less work as possible in each worker and then > > re-submit it is usually a good idea. > > From the point of view that the single work item invocation shouldn't hog > the worker, if the worker is shared, I agree. But what we also want is to > feed the GPU as fast as possible, ie. put the CPU to sleep as quickly as > possible. > > If we consider drivers with dedicated workqueues per hardware engine, or > even per userspace context, then especially in those cases I don't see what > is the benefit of playing the wq re-queue games. > > Anyway, I can park this patch for now, I *think* it will be easy to drop and > will just need to rebase 15/15 to cope. > > In the meantime I have collected some stats when running Cyberpunk 2077 > benchmark on amdgpu, just to remind myself that it does happen more than one > job can be ready to be passed on to the GPU. Stats of number of submitted > jobs per worker invocation (with this patch): > > 1 2 3 4 5 > gfx_0.0.0 21315 541 9849 171 0 > comp_1.3.0 3093 9 2 0 0 > comp_1.1.0 3501 46 2 1 0 > comp_1.0.1 3451 46 2 0 0 > sdma0 4400 746 279 481 7 > > This is for userspace contexts only. Quite a good number of three jobs > submitted per worker invocation. > > Kernel sdma appears to favour deeper queues even more but I forgot to log > above 2 jobs per worker invocation: > > 1 2 > sdma0 8009 1913 > > I can try to measure the latencies of worker re-queue approach. Another > interesting thing would be C-state residencies and CPU power. But given how > when the scheduler went from kthread to wq and lost the ability the queue > more than one job, I don't think back then anyone measured this? In which > case I suspect we even don't know if some latency or efficiency was lost. > > Regards, > > Tvrtko > > > > > > We can simply feed the hardware > > > > > with as much as it can take in one go and hopefully win some latency. > > > > > > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursu...@igalia.com> > > > > > Cc: Christian König <christian.koe...@amd.com> > > > > > Cc: Danilo Krummrich <d...@kernel.org> > > > > > Cc: Matthew Brost <matthew.br...@intel.com> > > > > > Cc: Philipp Stanner <pha...@kernel.org> > > > > > --- > > > > > drivers/gpu/drm/scheduler/sched_internal.h | 2 - > > > > > drivers/gpu/drm/scheduler/sched_main.c | 132 > > > > > ++++++++++----------- > > > > > drivers/gpu/drm/scheduler/sched_rq.c | 12 +- > > > > > 3 files changed, 64 insertions(+), 82 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_internal.h > > > > > b/drivers/gpu/drm/scheduler/sched_internal.h > > > > > index 15d78abc48df..1a5c2f255223 100644 > > > > > --- a/drivers/gpu/drm/scheduler/sched_internal.h > > > > > +++ b/drivers/gpu/drm/scheduler/sched_internal.h > > > > > @@ -22,8 +22,6 @@ struct drm_sched_entity_stats { > > > > > u64 vruntime; > > > > > }; > > > > > -bool drm_sched_can_queue(struct drm_gpu_scheduler *sched, > > > > > - struct drm_sched_entity *entity); > > > > > void drm_sched_wakeup(struct drm_gpu_scheduler *sched); > > > > > void drm_sched_rq_init(struct drm_gpu_scheduler *sched, > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > index 35025edea669..1fb3f1da4821 100644 > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > @@ -95,35 +95,6 @@ static u32 drm_sched_available_credits(struct > > > > > drm_gpu_scheduler *sched) > > > > > return credits; > > > > > } > > > > > -/** > > > > > - * drm_sched_can_queue -- Can we queue more to the hardware? > > > > > - * @sched: scheduler instance > > > > > - * @entity: the scheduler entity > > > > > - * > > > > > - * Return true if we can push at least one more job from @entity, > > > > > false > > > > > - * otherwise. > > > > > - */ > > > > > -bool drm_sched_can_queue(struct drm_gpu_scheduler *sched, > > > > > - struct drm_sched_entity *entity) > > > > > -{ > > > > > - struct drm_sched_job *s_job; > > > > > - > > > > > - s_job = drm_sched_entity_queue_peek(entity); > > > > > - if (!s_job) > > > > > - return false; > > > > > - > > > > > - /* If a job exceeds the credit limit, truncate it to the credit > > > > > limit > > > > > - * itself to guarantee forward progress. > > > > > - */ > > > > > - if (s_job->credits > sched->credit_limit) { > > > > > - dev_WARN(sched->dev, > > > > > - "Jobs may not exceed the credit limit, truncate.\n"); > > > > > - s_job->credits = sched->credit_limit; > > > > > - } > > > > > - > > > > > - return drm_sched_available_credits(sched) >= s_job->credits; > > > > > -} > > > > > - > > > > > /** > > > > > * drm_sched_run_job_queue - enqueue run-job work > > > > > * @sched: scheduler instance > > > > > @@ -940,54 +911,77 @@ static void drm_sched_run_job_work(struct > > > > > work_struct *w) > > > > > { > > > > > struct drm_gpu_scheduler *sched = > > > > > container_of(w, struct drm_gpu_scheduler, work_run_job); > > > > > + u32 job_credits, submitted_credits = 0; > > > > > struct drm_sched_entity *entity; > > > > > - struct dma_fence *fence; > > > > > struct drm_sched_fence *s_fence; > > > > > struct drm_sched_job *sched_job; > > > > > - int r; > > > > > + struct dma_fence *fence; > > > > > - /* Find entity with a ready job */ > > > > > - entity = drm_sched_rq_select_entity(sched, sched->rq); > > > > > - if (IS_ERR_OR_NULL(entity)) > > > > > - return; /* No more work */ > > > > > + while (!READ_ONCE(sched->pause_submit)) { > > > > > + /* Find entity with a ready job */ > > > > > + entity = drm_sched_rq_select_entity(sched, sched->rq); > > > > > + if (!entity) > > > > > + break; /* No more work */ > > > > > + > > > > > + sched_job = drm_sched_entity_queue_peek(entity); > > > > > + if (!sched_job) { > > > > > + complete_all(&entity->entity_idle); > > > > > + continue; > > > > > + } > > > > > + > > > > > + job_credits = sched_job->credits; > > > > > + /* > > > > > + * If a job exceeds the credit limit truncate it to guarantee > > > > > + * forward progress. > > > > > + */ > > > > > + if (dev_WARN_ONCE(sched->dev, job_credits > > > > > > sched->credit_limit, > > > > > + "Jobs may not exceed the credit limit, > > > > > truncating.\n")) > > > > > + job_credits = sched_job->credits = sched->credit_limit; > > > > > + > > > > > + if (job_credits > drm_sched_available_credits(sched)) { > > > > > + complete_all(&entity->entity_idle); > > > > > + break; > > > > > + } > > > > > + > > > > > + sched_job = drm_sched_entity_pop_job(entity); > > > > > + if (!sched_job) { > > > > > + /* Top entity is not yet runnable after all */ > > > > > + complete_all(&entity->entity_idle); > > > > > + continue; > > > > > + } > > > > > + > > > > > + s_fence = sched_job->s_fence; > > > > > + drm_sched_job_begin(sched_job); > > > > > + trace_drm_sched_job_run(sched_job, entity); > > > > > + submitted_credits += job_credits; > > > > > + atomic_add(job_credits, &sched->credit_count); > > > > > + > > > > > + fence = sched->ops->run_job(sched_job); > > > > > + drm_sched_fence_scheduled(s_fence, fence); > > > > > + > > > > > + if (!IS_ERR_OR_NULL(fence)) { > > > > > + int r; > > > > > + > > > > > + /* Drop for original kref_init of the fence */ > > > > > + dma_fence_put(fence); > > > > > + > > > > > + r = dma_fence_add_callback(fence, &sched_job->cb, > > > > > + drm_sched_job_done_cb); > > > > > + if (r == -ENOENT) > > > > > + drm_sched_job_done(sched_job, fence->error); > > > > > + else if (r) > > > > > + DRM_DEV_ERROR(sched->dev, > > > > > + "fence add callback failed (%d)\n", r); > > > > > + } else { > > > > > + drm_sched_job_done(sched_job, IS_ERR(fence) ? > > > > > + PTR_ERR(fence) : 0); > > > > > + } > > > > > - sched_job = drm_sched_entity_pop_job(entity); > > > > > - if (!sched_job) { > > > > > complete_all(&entity->entity_idle); > > > > > - drm_sched_run_job_queue(sched); > > > > > - return; > > > > > } > > > > > - s_fence = sched_job->s_fence; > > > > > - > > > > > - atomic_add(sched_job->credits, &sched->credit_count); > > > > > - drm_sched_job_begin(sched_job); > > > > > - > > > > > - trace_drm_sched_job_run(sched_job, entity); > > > > > - /* > > > > > - * The run_job() callback must by definition return a fence whose > > > > > - * refcount has been incremented for the scheduler already. > > > > > - */ > > > > > - fence = sched->ops->run_job(sched_job); > > > > > - complete_all(&entity->entity_idle); > > > > > - drm_sched_fence_scheduled(s_fence, fence); > > > > > - > > > > > - if (!IS_ERR_OR_NULL(fence)) { > > > > > - r = dma_fence_add_callback(fence, &sched_job->cb, > > > > > - drm_sched_job_done_cb); > > > > > - if (r == -ENOENT) > > > > > - drm_sched_job_done(sched_job, fence->error); > > > > > - else if (r) > > > > > - DRM_DEV_ERROR(sched->dev, "fence add callback failed > > > > > (%d)\n", r); > > > > > - > > > > > - dma_fence_put(fence); > > > > > - } else { > > > > > - drm_sched_job_done(sched_job, IS_ERR(fence) ? > > > > > - PTR_ERR(fence) : 0); > > > > > - } > > > > > - > > > > > - wake_up(&sched->job_scheduled); > > > > > - drm_sched_run_job_queue(sched); > > > > > + if (submitted_credits) > > > > > + wake_up(&sched->job_scheduled); > > > > > } > > > > > static struct workqueue_struct *drm_sched_alloc_wq(const char > > > > > *name) > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_rq.c > > > > > b/drivers/gpu/drm/scheduler/sched_rq.c > > > > > index e22f9ff88822..f0afdc0bd417 100644 > > > > > --- a/drivers/gpu/drm/scheduler/sched_rq.c > > > > > +++ b/drivers/gpu/drm/scheduler/sched_rq.c > > > > > @@ -197,9 +197,7 @@ void drm_sched_rq_pop_entity(struct > > > > > drm_sched_entity *entity) > > > > > * > > > > > * Find oldest waiting ready entity. > > > > > * > > > > > - * Return an entity if one is found; return an error-pointer (!NULL) > > > > > if an > > > > > - * entity was ready, but the scheduler had insufficient credits to > > > > > accommodate > > > > > - * its job; return NULL, if no ready entity was found. > > > > > + * Return an entity if one is found or NULL if no ready entity was > > > > > found. > > > > > */ > > > > > struct drm_sched_entity * > > > > > drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched, > > > > > @@ -213,14 +211,6 @@ drm_sched_rq_select_entity(struct > > > > > drm_gpu_scheduler *sched, > > > > > entity = rb_entry(rb, struct drm_sched_entity, > > > > > rb_tree_node); > > > > > if (drm_sched_entity_is_ready(entity)) { > > > > > - /* If we can't queue yet, preserve the current entity in > > > > > - * terms of fairness. > > > > > - */ > > > > > - if (!drm_sched_can_queue(sched, entity)) { > > > > > - spin_unlock(&rq->lock); > > > > > - return ERR_PTR(-ENOSPC); > > > > > - } > > > > > - > > > > > reinit_completion(&entity->entity_idle); > > > > > break; > > > > > } > > > > > > > > > >