Le 25/08/2025 à 15:13, Philipp Stanner a écrit :
On Fri, 2025-08-22 at 15:43 +0200, Pierre-Eric Pelloux-Prayer wrote:
Currently, the scheduler score is incremented when a job is pushed to an
entity and when an entity is attached to the scheduler.

It's indeed awkward why attaching is treated equivalently to job
submission.

Can you expand the documentation for drm_sched_init_args a bit so that
it gets clearer what the score is supposed to do?


drm_sched_init_args.score is the feature allowing multiple schedulers to share a score so I suppose you meant drm_gpu_scheduler.score?

The doc currently says "score to help loadbalancer pick a idle sched" which is a bit vague. It could be modified to become:

  @score: holds the number of yet-to-be-completed jobs pushed to each scheduler.
          It's used when load balancing between different schedulers.

What do you think?



This leads to some bad scheduling decision where the score value is
largely made of idle entities.

For instance, a scenario with 2 schedulers and where 10 entities submit
a single job, then do nothing, each scheduler will probably end up with
a score of 5.
Now, 5 userspace apps exit, so their entities will be dropped.


"entities will be dropped" == "drm_sched_entity_kill() gets called",
right?

Yes.

In
the worst case, these apps' entities where all attached to the same

s/where/were

or better yet: "could be"

Will fix, thanks.


scheduler and we end up with score=5 (the 5 remaining entities) and
score=0, despite the 2 schedulers being idle.

Sounds indeed like a (small) problem to me.


When new entities show up, they will all select the second scheduler
based on its low score value, instead of alternating between the 2.

Some amdgpu rings depended on this feature, but the previous commit
implemented the same thing in amdgpu directly so it can be safely
removed from drm/sched.

Can we be that sure that other drivers don't depend on it, though? I
suspect it's likely that it's just amdgpu, but…


Aside from the new "rocket" as pointed out by Tvrtko, amdgpu is the only driver passing more than one schedulers to entities so they're the only ones that could be affected.

I verified amdgpu and Tvrtko pinged the rocket maintainers in the other thread.



BTW, since you're cleaning up related stuff currently: I saw that it
seems that the only driver that sets &struct drm_sched_init_args.score
is amdgpu. Would be cool if you can take a look whether that's still
needed.

It cannot really be removed yet as it's useful when a single hardware block is exposed through different schedulers (so pushing jobs to one of the schedulers should increase the load of the underlying hw).

Thanks,
Pierre-Eric



P.


Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-pra...@amd.com>
---
  drivers/gpu/drm/scheduler/sched_main.c | 2 --
  1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 5a550fd76bf0..e6d232a8ec58 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -206,7 +206,6 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
        if (!list_empty(&entity->list))
                return;
- atomic_inc(rq->sched->score);
        list_add_tail(&entity->list, &rq->entities);
  }
@@ -228,7 +227,6 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,   spin_lock(&rq->lock); - atomic_dec(rq->sched->score);
        list_del_init(&entity->list);
  if (rq->current_entity == entity)

Reply via email to