On Fri, 22 May 2026 17:25:18 +0100 Tvrtko Ursulin <[email protected]> wrote:
> On 22/05/2026 17:08, Boris Brezillon wrote: > > On Fri, 22 May 2026 12:38:17 +0100 > > Tvrtko Ursulin <[email protected]> wrote: > > > >> Currently an unordered workqueue is used for the DRM scheduler which means > >> its concurrency is externally managed, and given there is one scheduler > >> instance per userspace queue, that means workqueue management logic is > >> within its rights to spawn many kernel threads to submit their respective > >> jobs. > >> > >> Problem there is that all run job callbacks are serialized on the device > >> global mutex, making the potential thread storm just causing lock > >> contention. > > > > Yeah, so initially we were not supposed to take the lock over the whole > > run_job() function. We should normally be able to queue things to the > > ring buffer, and only briefly take the lock to check if the context is > > still resident and kick the group scheduler if it's not. I agree that > > in practice it turned in a huge synchronization point. I guess we should > > consider turning that mutex into a rwsem that's taken in write mode in > > the tick path, and read-mode elsewhere. > > There is some software state modified too, so I am not sure how easy or > hard it would be to make run job only hold the read lock? Yeah, it's probably not as easy as it sounds. > > >> If we add a separate ordered workqueue for the DRM scheduler integration > >> we can avoid this problem, since the ordered property directly expresses > >> the nature of the submission backend implementation. > > > > I don't see alloc_ordered_workqueue() being used for the sched_wq > > workqueue, is that intended (according to this comment, you seem to > > want an ordered wq). > > Yeah, it seems I mistyped the wq allocation below. > > >> And considering the other user of this workqueue, the free job callback, > >> which is not globally serialized in this manner so could be thought to > >> potentially regress with this change, it should not be the case since > >> commit > >> a58f317c1ca0 ("drm/sched: Free all finished jobs at once") > >> made the DRM scheduler handle the cleanup of finished jobs more promptly. > >> > >> Signed-off-by: Tvrtko Ursulin <[email protected]> > >> Cc: Boris Brezillon <[email protected]> > >> Cc: Liviu Dudau <[email protected]> > >> Cc: Steven Price <[email protected]> > >> --- > >> drivers/gpu/drm/panthor/panthor_sched.c | 27 ++++++++++++++++--------- > >> 1 file changed, 17 insertions(+), 10 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c > >> b/drivers/gpu/drm/panthor/panthor_sched.c > >> index 2bee1c92fb9e..cc6b3e2b015a 100644 > >> --- a/drivers/gpu/drm/panthor/panthor_sched.c > >> +++ b/drivers/gpu/drm/panthor/panthor_sched.c > >> @@ -147,13 +147,11 @@ struct panthor_scheduler { > >> struct panthor_device *ptdev; > >> > >> /** > >> - * @wq: Workqueue used by our internal scheduler logic and > >> - * drm_gpu_scheduler. > >> + * @wq: Workqueue used by our internal scheduler logic. > >> * > >> * Used for the scheduler tick, group update or other kind of FW > >> * event processing that can't be handled in the threaded interrupt > >> - * path. Also passed to the drm_gpu_scheduler instances embedded > >> - * in panthor_queue. > >> + * path. > >> */ > >> struct workqueue_struct *wq; > >> > >> @@ -166,6 +164,14 @@ struct panthor_scheduler { > >> */ > >> struct workqueue_struct *heap_alloc_wq; > >> > >> + /** > >> + * @sched_wq: Workqueue used for the DRM scheduler. > >> + * > >> + * Workqueue used for drm_gpu_scheduler instances embedded in > >> + * panthor_queue. > >> + */ > >> + struct workqueue_struct *sched_wq; > >> + > >> /** @tick_work: Work executed on a scheduling tick. */ > >> struct delayed_work tick_work; > >> > >> @@ -3488,7 +3494,7 @@ group_create_queue(struct panthor_group *group, > >> { > >> struct drm_sched_init_args sched_args = { > >> .ops = &panthor_queue_sched_ops, > >> - .submit_wq = group->ptdev->scheduler->wq, > >> + .submit_wq = group->ptdev->scheduler->sched_wq, > >> /* > >> * The credit limit argument tells us the total number of > >> * instructions across all CS slots in the ringbuffer, with > >> @@ -4078,6 +4084,9 @@ static void panthor_sched_fini(struct drm_device > >> *ddev, void *res) > >> if (sched->heap_alloc_wq) > >> destroy_workqueue(sched->heap_alloc_wq); > >> > >> + if (sched->sched_wq) > >> + destroy_workqueue(sched->sched_wq); > >> + > >> for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) { > >> drm_WARN_ON(ddev, !list_empty(&sched->groups.runnable[prio])); > >> drm_WARN_ON(ddev, !list_empty(&sched->groups.idle[prio])); > >> @@ -4167,13 +4176,11 @@ int panthor_sched_init(struct panthor_device > >> *ptdev) > >> * FW is smart enough to fall back on other methods if the kernel can't > >> * allocate memory, and fail the tiling job if none of these > >> * countermeasures worked. > >> - * > >> - * Set WQ_MEM_RECLAIM on sched->wq to unblock the situation when the > >> - * system is running out of memory. > >> */ > >> sched->heap_alloc_wq = alloc_workqueue("panthor-heap-alloc", > >> WQ_UNBOUND, 0); > >> - sched->wq = alloc_workqueue("panthor-csf-sched", WQ_MEM_RECLAIM | > >> WQ_UNBOUND, 0); > >> - if (!sched->wq || !sched->heap_alloc_wq) { > >> + sched->wq = alloc_workqueue("panthor-csf-sched", WQ_UNBOUND, 0); > >> + sched->sched_wq = alloc_workqueue("panthor-drm-sched", WQ_MEM_RECLAIM, > >> 0); > > > > The other one also needs MEM_RECLAIM, because you need work items > > queued to sched->wq to run to guarantee forward progress, and you need > > to guarantee forward progress to reclaim GPU mem. > > Ack, I had a feeling that might be the case. > > I will respin next week or so. Or if you tell me the global lock can be > easily dropped from .run_job I can drop and forget about it. Wider > context is that I am experimenting with kthread_worker conversion and > trying to polish a > somewhat-broken-but-showing-great-latency-improvements branch. Yep, I know, your experimental branch is actually on my list of things to look at/test ;-). > For that > I can kind of take either one global worker, or one worker per client > route for the RFC, no big deal either way for the prototype. We probably want a worker per-prio (and possibly per-cpu), but certainly not one per-client, or you'll end up with the thread explosion that was addressed by the kthread -> workqueue transition. Also, I didn't look at your panthor changes in this branch yet, but if we're switching drm_sched to kthread workers, we probably want to transition most existing panthor works to kthread_work, because some FW events might need to processed for the GPU context to be unblocked, and if we keep queuing those to a regular workqueue, they will be lagging behing the HI_PRIO thread you have for HI_PRIO contexts.
