On Fri, 22 May 2026 17:25:18 +0100
Tvrtko Ursulin <[email protected]> wrote:

> On 22/05/2026 17:08, Boris Brezillon wrote:
> > On Fri, 22 May 2026 12:38:17 +0100
> > Tvrtko Ursulin <[email protected]> wrote:
> >   
> >> Currently an unordered workqueue is used for the DRM scheduler which means
> >> its concurrency is externally managed, and given there is one scheduler
> >> instance per userspace queue, that means workqueue management logic is
> >> within its rights to spawn many kernel threads to submit their respective
> >> jobs.
> >>
> >> Problem there is that all run job callbacks are serialized on the device
> >> global mutex, making the potential thread storm just causing lock
> >> contention.  
> > 
> > Yeah, so initially we were not supposed to take the lock over the whole
> > run_job() function. We should normally be able to queue things to the
> > ring buffer, and only briefly take the lock to check if the context is
> > still resident and kick the group scheduler if it's not. I agree that
> > in practice it turned in a huge synchronization point. I guess we should
> > consider turning that mutex into a rwsem that's taken in write mode in
> > the tick path, and read-mode elsewhere.  
> 
> There is some software state modified too, so I am not sure how easy or 
> hard it would be to make run job only hold the read lock?

Yeah, it's probably not as easy as it sounds.

> 
> >> If we add a separate ordered workqueue for the DRM scheduler integration
> >> we can avoid this problem, since the ordered property directly expresses
> >> the nature of the submission backend implementation.  
> > 
> > I don't see alloc_ordered_workqueue() being used for the sched_wq
> > workqueue, is that intended (according to this comment, you seem to
> > want an ordered wq).  
> 
> Yeah, it seems I mistyped the wq allocation below.
> 
> >> And considering the other user of this workqueue, the free job callback,
> >> which is not globally serialized in this manner so could be thought to
> >> potentially regress with this change, it should not be the case since
> >> commit
> >> a58f317c1ca0 ("drm/sched: Free all finished jobs at once")
> >> made the DRM scheduler handle the cleanup of finished jobs more promptly.
> >>
> >> Signed-off-by: Tvrtko Ursulin <[email protected]>
> >> Cc: Boris Brezillon <[email protected]>
> >> Cc: Liviu Dudau <[email protected]>
> >> Cc: Steven Price <[email protected]>
> >> ---
> >>   drivers/gpu/drm/panthor/panthor_sched.c | 27 ++++++++++++++++---------
> >>   1 file changed, 17 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
> >> b/drivers/gpu/drm/panthor/panthor_sched.c
> >> index 2bee1c92fb9e..cc6b3e2b015a 100644
> >> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> >> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> >> @@ -147,13 +147,11 @@ struct panthor_scheduler {
> >>    struct panthor_device *ptdev;
> >>   
> >>    /**
> >> -   * @wq: Workqueue used by our internal scheduler logic and
> >> -   * drm_gpu_scheduler.
> >> +   * @wq: Workqueue used by our internal scheduler logic.
> >>     *
> >>     * Used for the scheduler tick, group update or other kind of FW
> >>     * event processing that can't be handled in the threaded interrupt
> >> -   * path. Also passed to the drm_gpu_scheduler instances embedded
> >> -   * in panthor_queue.
> >> +   * path.
> >>     */
> >>    struct workqueue_struct *wq;
> >>   
> >> @@ -166,6 +164,14 @@ struct panthor_scheduler {
> >>     */
> >>    struct workqueue_struct *heap_alloc_wq;
> >>   
> >> +  /**
> >> +   * @sched_wq: Workqueue used for the DRM scheduler.
> >> +   *
> >> +   * Workqueue used for drm_gpu_scheduler instances embedded in
> >> +   * panthor_queue.
> >> +   */
> >> +  struct workqueue_struct *sched_wq;
> >> +
> >>    /** @tick_work: Work executed on a scheduling tick. */
> >>    struct delayed_work tick_work;
> >>   
> >> @@ -3488,7 +3494,7 @@ group_create_queue(struct panthor_group *group,
> >>   {
> >>    struct drm_sched_init_args sched_args = {
> >>            .ops = &panthor_queue_sched_ops,
> >> -          .submit_wq = group->ptdev->scheduler->wq,
> >> +          .submit_wq = group->ptdev->scheduler->sched_wq,
> >>            /*
> >>             * The credit limit argument tells us the total number of
> >>             * instructions across all CS slots in the ringbuffer, with
> >> @@ -4078,6 +4084,9 @@ static void panthor_sched_fini(struct drm_device 
> >> *ddev, void *res)
> >>    if (sched->heap_alloc_wq)
> >>            destroy_workqueue(sched->heap_alloc_wq);
> >>   
> >> +  if (sched->sched_wq)
> >> +          destroy_workqueue(sched->sched_wq);
> >> +
> >>    for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
> >>            drm_WARN_ON(ddev, !list_empty(&sched->groups.runnable[prio]));
> >>            drm_WARN_ON(ddev, !list_empty(&sched->groups.idle[prio]));
> >> @@ -4167,13 +4176,11 @@ int panthor_sched_init(struct panthor_device 
> >> *ptdev)
> >>     * FW is smart enough to fall back on other methods if the kernel can't
> >>     * allocate memory, and fail the tiling job if none of these
> >>     * countermeasures worked.
> >> -   *
> >> -   * Set WQ_MEM_RECLAIM on sched->wq to unblock the situation when the
> >> -   * system is running out of memory.
> >>     */
> >>    sched->heap_alloc_wq = alloc_workqueue("panthor-heap-alloc", 
> >> WQ_UNBOUND, 0);
> >> -  sched->wq = alloc_workqueue("panthor-csf-sched", WQ_MEM_RECLAIM | 
> >> WQ_UNBOUND, 0);
> >> -  if (!sched->wq || !sched->heap_alloc_wq) {
> >> +  sched->wq = alloc_workqueue("panthor-csf-sched", WQ_UNBOUND, 0);
> >> +  sched->sched_wq = alloc_workqueue("panthor-drm-sched", WQ_MEM_RECLAIM, 
> >> 0);  
> > 
> > The other one also needs MEM_RECLAIM, because you need work items
> > queued to sched->wq to run to guarantee forward progress, and you need
> > to guarantee forward progress to reclaim GPU mem.  
> 
> Ack, I had a feeling that might be the case.
> 
> I will respin next week or so. Or if you tell me the global lock can be 
> easily dropped from .run_job I can drop and forget about it. Wider 
> context is that I am experimenting with kthread_worker conversion and 
> trying to polish a 
> somewhat-broken-but-showing-great-latency-improvements branch.

Yep, I know, your experimental branch is actually on my list of things
to look at/test ;-).

> For that 
> I can kind of take either one global worker, or one worker per client 
> route for the RFC, no big deal either way for the prototype.

We probably want a worker per-prio (and possibly per-cpu), but certainly
not one per-client, or you'll end up with the thread explosion that was
addressed by the kthread -> workqueue transition.

Also, I didn't look at your panthor changes in this branch yet, but if
we're switching drm_sched to kthread workers, we probably want to
transition most existing panthor works to kthread_work, because some FW
events might need to processed for the GPU context to be unblocked, and
if we keep queuing those to a regular workqueue, they will be lagging
behing the HI_PRIO thread you have for HI_PRIO contexts.

Reply via email to