Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-09 Thread Luben Tuikov
On 2023-11-09 18:41, Danilo Krummrich wrote:
> On 11/9/23 20:24, Danilo Krummrich wrote:
>> On 11/9/23 07:52, Luben Tuikov wrote:
>>> Hi,
>>>
>>> On 2023-11-07 19:41, Danilo Krummrich wrote:
 On 11/7/23 05:10, Luben Tuikov wrote:
> Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In 
> fact,
> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and 
> let
> it do just that, schedule the work item for execution.
>
> The problem is that drm_sched_run_job_queue() calls 
> drm_sched_select_entity()
> to determine if the scheduler has an entity ready in one of its 
> run-queues,
> and in the case of the Round-Robin (RR) scheduling, the function
> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
> which is ready, sets up the run-queue and completion and returns that
> entity. The FIFO scheduling algorithm is unaffected.
>
> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), 
> then
> in the case of RR scheduling, that would result in 
> drm_sched_select_entity()
> having been called twice, which may result in skipping a ready entity if 
> more
> than one entity is ready. This commit fixes this by eliminating the call 
> to
> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it 
> only
> in drm_sched_run_job_work().
>
> v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
>   Add fixes-tag. (Tvrtko)
>
> Signed-off-by: Luben Tuikov 
> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
> ---
>    drivers/gpu/drm/scheduler/sched_main.c | 16 +++-
>    1 file changed, 3 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 27843e37d9b769..cd0dc3f81d05f0 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq 
> *rq)
>    }
>    /**
> - * __drm_sched_run_job_queue - enqueue run-job work
> + * drm_sched_run_job_queue - enqueue run-job work
>     * @sched: scheduler instance
>     */
> -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>    {
>    if (!READ_ONCE(sched->pause_submit))
>    queue_work(sched->submit_wq, >work_run_job);
> @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct 
> drm_gpu_scheduler *sched)
>    void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
>    {
>    if (drm_sched_can_queue(sched))
> -    __drm_sched_run_job_queue(sched);
> +    drm_sched_run_job_queue(sched);
>    }
>    /**
> @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
> **sched_list,
>    }
>    EXPORT_SYMBOL(drm_sched_pick_best);
> -/**
> - * drm_sched_run_job_queue - enqueue run-job work if there are ready 
> entities
> - * @sched: scheduler instance
> - */
> -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> -{
> -    if (drm_sched_select_entity(sched))

 Hm, now that I rebase my patch to implement dynamic job-flow control I 
 recognize that
 we probably need the peek semantics here. If we do not select an entity 
 here, we also
 do not check whether the corresponding job fits on the ring.

 Alternatively, we simply can't do this check in drm_sched_wakeup(). The 
 consequence would
 be that we don't detect that we need to wait for credits to free up before 
 the run work is
 already executing and the run work selects an entity.
>>>
>>> So I rebased v5 on top of the latest drm-misc-next, and looked around and 
>>> found out that
>>> drm_sched_wakeup() is missing drm_sched_entity_is_ready(). It should look 
>>> like the following,
>>
>> Yeah, but that's just the consequence of re-basing it onto Tvrtko's patch.
>>
>> My point is that by removing drm_sched_select_entity() from 
>> drm_sched_run_job_queue() we do not
>> only loose the check whether the selected entity is ready, but also whether 
>> we have enough
>> credits to actually run a new job. This can lead to queuing up work that 
>> does nothing but calling
>> drm_sched_select_entity() and return.
> 
> Ok, I see it now.  We don't need to peek, we know the entity at 
> drm_sched_wakeup().
> 
> However, the missing drm_sched_entity_is_ready() check should have been added 
> already when
> drm_sched_select_entity() was removed. Gonna send a fix for that as well.

Let me do that, since I added it to your patch.
Then you can rebase your credits patch onto mine.

-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc

Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-09 Thread Danilo Krummrich

On 11/9/23 20:24, Danilo Krummrich wrote:

On 11/9/23 07:52, Luben Tuikov wrote:

Hi,

On 2023-11-07 19:41, Danilo Krummrich wrote:

On 11/7/23 05:10, Luben Tuikov wrote:

Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
it do just that, schedule the work item for execution.

The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
to determine if the scheduler has an entity ready in one of its run-queues,
and in the case of the Round-Robin (RR) scheduling, the function
drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
which is ready, sets up the run-queue and completion and returns that
entity. The FIFO scheduling algorithm is unaffected.

Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then
in the case of RR scheduling, that would result in drm_sched_select_entity()
having been called twice, which may result in skipping a ready entity if more
than one entity is ready. This commit fixes this by eliminating the call to
drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
in drm_sched_run_job_work().

v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
  Add fixes-tag. (Tvrtko)

Signed-off-by: Luben Tuikov 
Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
---
   drivers/gpu/drm/scheduler/sched_main.c | 16 +++-
   1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 27843e37d9b769..cd0dc3f81d05f0 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
   }
   /**
- * __drm_sched_run_job_queue - enqueue run-job work
+ * drm_sched_run_job_queue - enqueue run-job work
    * @sched: scheduler instance
    */
-static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
+static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
   {
   if (!READ_ONCE(sched->pause_submit))
   queue_work(sched->submit_wq, >work_run_job);
@@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler 
*sched)
   void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
   {
   if (drm_sched_can_queue(sched))
-    __drm_sched_run_job_queue(sched);
+    drm_sched_run_job_queue(sched);
   }
   /**
@@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
**sched_list,
   }
   EXPORT_SYMBOL(drm_sched_pick_best);
-/**
- * drm_sched_run_job_queue - enqueue run-job work if there are ready entities
- * @sched: scheduler instance
- */
-static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
-{
-    if (drm_sched_select_entity(sched))


Hm, now that I rebase my patch to implement dynamic job-flow control I 
recognize that
we probably need the peek semantics here. If we do not select an entity here, 
we also
do not check whether the corresponding job fits on the ring.

Alternatively, we simply can't do this check in drm_sched_wakeup(). The 
consequence would
be that we don't detect that we need to wait for credits to free up before the 
run work is
already executing and the run work selects an entity.


So I rebased v5 on top of the latest drm-misc-next, and looked around and found 
out that
drm_sched_wakeup() is missing drm_sched_entity_is_ready(). It should look like 
the following,


Yeah, but that's just the consequence of re-basing it onto Tvrtko's patch.

My point is that by removing drm_sched_select_entity() from 
drm_sched_run_job_queue() we do not
only loose the check whether the selected entity is ready, but also whether we 
have enough
credits to actually run a new job. This can lead to queuing up work that does 
nothing but calling
drm_sched_select_entity() and return.


Ok, I see it now.  We don't need to peek, we know the entity at 
drm_sched_wakeup().

However, the missing drm_sched_entity_is_ready() check should have been added 
already when
drm_sched_select_entity() was removed. Gonna send a fix for that as well.

- Danilo



By peeking the entity we could know this *before* scheduling work and hence 
avoid some CPU scheduler
overhead.

However, since this patch already landed and we can fail the same way if the 
selected entity isn't
ready I don't consider this to be a blocker for the credit patch, hence I will 
send out a v6.



void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
  struct drm_sched_entity *entity)
{
if (drm_sched_entity_is_ready(entity))
    if (drm_sched_can_queue(sched, entity))
    drm_sched_run_job_queue(sched);
}

See the attached patch. (Currently running with base-commit and the attached 
patch.)




Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-09 Thread Danilo Krummrich

On 11/9/23 07:52, Luben Tuikov wrote:

Hi,

On 2023-11-07 19:41, Danilo Krummrich wrote:

On 11/7/23 05:10, Luben Tuikov wrote:

Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
it do just that, schedule the work item for execution.

The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
to determine if the scheduler has an entity ready in one of its run-queues,
and in the case of the Round-Robin (RR) scheduling, the function
drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
which is ready, sets up the run-queue and completion and returns that
entity. The FIFO scheduling algorithm is unaffected.

Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then
in the case of RR scheduling, that would result in drm_sched_select_entity()
having been called twice, which may result in skipping a ready entity if more
than one entity is ready. This commit fixes this by eliminating the call to
drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
in drm_sched_run_job_work().

v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
  Add fixes-tag. (Tvrtko)

Signed-off-by: Luben Tuikov 
Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
---
   drivers/gpu/drm/scheduler/sched_main.c | 16 +++-
   1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 27843e37d9b769..cd0dc3f81d05f0 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
   }
   
   /**

- * __drm_sched_run_job_queue - enqueue run-job work
+ * drm_sched_run_job_queue - enqueue run-job work
* @sched: scheduler instance
*/
-static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
+static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
   {
if (!READ_ONCE(sched->pause_submit))
queue_work(sched->submit_wq, >work_run_job);
@@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler 
*sched)
   void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
   {
if (drm_sched_can_queue(sched))
-   __drm_sched_run_job_queue(sched);
+   drm_sched_run_job_queue(sched);
   }
   
   /**

@@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
**sched_list,
   }
   EXPORT_SYMBOL(drm_sched_pick_best);
   
-/**

- * drm_sched_run_job_queue - enqueue run-job work if there are ready entities
- * @sched: scheduler instance
- */
-static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
-{
-   if (drm_sched_select_entity(sched))


Hm, now that I rebase my patch to implement dynamic job-flow control I 
recognize that
we probably need the peek semantics here. If we do not select an entity here, 
we also
do not check whether the corresponding job fits on the ring.

Alternatively, we simply can't do this check in drm_sched_wakeup(). The 
consequence would
be that we don't detect that we need to wait for credits to free up before the 
run work is
already executing and the run work selects an entity.


So I rebased v5 on top of the latest drm-misc-next, and looked around and found 
out that
drm_sched_wakeup() is missing drm_sched_entity_is_ready(). It should look like 
the following,


Yeah, but that's just the consequence of re-basing it onto Tvrtko's patch.

My point is that by removing drm_sched_select_entity() from 
drm_sched_run_job_queue() we do not
only loose the check whether the selected entity is ready, but also whether we 
have enough
credits to actually run a new job. This can lead to queuing up work that does 
nothing but calling
drm_sched_select_entity() and return.

By peeking the entity we could know this *before* scheduling work and hence 
avoid some CPU scheduler
overhead.

However, since this patch already landed and we can fail the same way if the 
selected entity isn't
ready I don't consider this to be a blocker for the credit patch, hence I will 
send out a v6.



void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
  struct drm_sched_entity *entity)
{
if (drm_sched_entity_is_ready(entity))
if (drm_sched_can_queue(sched, entity))
drm_sched_run_job_queue(sched);
}

See the attached patch. (Currently running with base-commit and the attached 
patch.)




Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-08 Thread Luben Tuikov
Hi,

On 2023-11-07 19:41, Danilo Krummrich wrote:
> On 11/7/23 05:10, Luben Tuikov wrote:
>> Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
>> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
>> it do just that, schedule the work item for execution.
>>
>> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
>> to determine if the scheduler has an entity ready in one of its run-queues,
>> and in the case of the Round-Robin (RR) scheduling, the function
>> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
>> which is ready, sets up the run-queue and completion and returns that
>> entity. The FIFO scheduling algorithm is unaffected.
>>
>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), 
>> then
>> in the case of RR scheduling, that would result in drm_sched_select_entity()
>> having been called twice, which may result in skipping a ready entity if more
>> than one entity is ready. This commit fixes this by eliminating the call to
>> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
>> in drm_sched_run_job_work().
>>
>> v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
>>  Add fixes-tag. (Tvrtko)
>>
>> Signed-off-by: Luben Tuikov 
>> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 16 +++-
>>   1 file changed, 3 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 27843e37d9b769..cd0dc3f81d05f0 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq 
>> *rq)
>>   }
>>   
>>   /**
>> - * __drm_sched_run_job_queue - enqueue run-job work
>> + * drm_sched_run_job_queue - enqueue run-job work
>>* @sched: scheduler instance
>>*/
>> -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>   {
>>  if (!READ_ONCE(sched->pause_submit))
>>  queue_work(sched->submit_wq, >work_run_job);
>> @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler 
>> *sched)
>>   void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
>>   {
>>  if (drm_sched_can_queue(sched))
>> -__drm_sched_run_job_queue(sched);
>> +drm_sched_run_job_queue(sched);
>>   }
>>   
>>   /**
>> @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
>> **sched_list,
>>   }
>>   EXPORT_SYMBOL(drm_sched_pick_best);
>>   
>> -/**
>> - * drm_sched_run_job_queue - enqueue run-job work if there are ready 
>> entities
>> - * @sched: scheduler instance
>> - */
>> -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>> -{
>> -if (drm_sched_select_entity(sched))
> 
> Hm, now that I rebase my patch to implement dynamic job-flow control I 
> recognize that
> we probably need the peek semantics here. If we do not select an entity here, 
> we also
> do not check whether the corresponding job fits on the ring.
> 
> Alternatively, we simply can't do this check in drm_sched_wakeup(). The 
> consequence would
> be that we don't detect that we need to wait for credits to free up before 
> the run work is
> already executing and the run work selects an entity.

So I rebased v5 on top of the latest drm-misc-next, and looked around and found 
out that
drm_sched_wakeup() is missing drm_sched_entity_is_ready(). It should look like 
the following,

void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
  struct drm_sched_entity *entity)
{
if (drm_sched_entity_is_ready(entity))
if (drm_sched_can_queue(sched, entity))
drm_sched_run_job_queue(sched);
}

See the attached patch. (Currently running with base-commit and the attached 
patch.)
-- 
Regards,
Luben
From 65b8b8be52e8c112d7350397cb54b4fb3470b008 Mon Sep 17 00:00:00 2001
From: Danilo Krummrich 
Date: Thu, 2 Nov 2023 01:10:34 +0100
Subject: [PATCH] drm/sched: implement dynamic job-flow control

Currently, job flow control is implemented simply by limiting the number
of jobs in flight. Therefore, a scheduler is initialized with a credit
limit that corresponds to the number of jobs which can be sent to the
hardware.

This implies that for each job, drivers need to account for the maximum
job size possible in order to not overflow the ring buffer.

However, there are drivers, such as Nouveau, where the job size has a
rather large range. For such drivers it can easily happen that job
submissions not even filling the ring by 1% can block subsequent
submissions, which, in the worst case, can lead to the ring run dry.

In order to overcome this issue, allow for tracking the actual job size
instead of the 

Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-07 Thread Luben Tuikov
On 2023-11-07 12:53, Danilo Krummrich wrote:
> On 11/7/23 05:10, Luben Tuikov wrote:
>> Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
>> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
>> it do just that, schedule the work item for execution.
>>
>> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
>> to determine if the scheduler has an entity ready in one of its run-queues,
>> and in the case of the Round-Robin (RR) scheduling, the function
>> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
>> which is ready, sets up the run-queue and completion and returns that
>> entity. The FIFO scheduling algorithm is unaffected.
>>
>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), 
>> then
>> in the case of RR scheduling, that would result in drm_sched_select_entity()
>> having been called twice, which may result in skipping a ready entity if more
>> than one entity is ready. This commit fixes this by eliminating the call to
>> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
>> in drm_sched_run_job_work().
>>
>> v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
>>  Add fixes-tag. (Tvrtko)
>>
>> Signed-off-by: Luben Tuikov 
>> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
> 
> Reviewed-by: Danilo Krummrich 

Thank you, sir!
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-07 Thread Luben Tuikov
On 2023-11-07 06:48, Matthew Brost wrote:
> On Mon, Nov 06, 2023 at 11:10:21PM -0500, Luben Tuikov wrote:
>> Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
>> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
>> it do just that, schedule the work item for execution.
>>
>> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
>> to determine if the scheduler has an entity ready in one of its run-queues,
>> and in the case of the Round-Robin (RR) scheduling, the function
>> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
>> which is ready, sets up the run-queue and completion and returns that
>> entity. The FIFO scheduling algorithm is unaffected.
>>
>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), 
>> then
>> in the case of RR scheduling, that would result in drm_sched_select_entity()
>> having been called twice, which may result in skipping a ready entity if more
>> than one entity is ready. This commit fixes this by eliminating the call to
>> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
>> in drm_sched_run_job_work().
>>
>> v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
>> Add fixes-tag. (Tvrtko)
>>
>> Signed-off-by: Luben Tuikov 
>> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
> 
> Reviewed-by: Matthew Brost 

Thank you, sir!
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-07 Thread Danilo Krummrich

On 11/7/23 05:10, Luben Tuikov wrote:

Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
it do just that, schedule the work item for execution.

The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
to determine if the scheduler has an entity ready in one of its run-queues,
and in the case of the Round-Robin (RR) scheduling, the function
drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
which is ready, sets up the run-queue and completion and returns that
entity. The FIFO scheduling algorithm is unaffected.

Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then
in the case of RR scheduling, that would result in drm_sched_select_entity()
having been called twice, which may result in skipping a ready entity if more
than one entity is ready. This commit fixes this by eliminating the call to
drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
in drm_sched_run_job_work().

v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
 Add fixes-tag. (Tvrtko)

Signed-off-by: Luben Tuikov 
Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
---
  drivers/gpu/drm/scheduler/sched_main.c | 16 +++-
  1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 27843e37d9b769..cd0dc3f81d05f0 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
  }
  
  /**

- * __drm_sched_run_job_queue - enqueue run-job work
+ * drm_sched_run_job_queue - enqueue run-job work
   * @sched: scheduler instance
   */
-static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
+static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
  {
if (!READ_ONCE(sched->pause_submit))
queue_work(sched->submit_wq, >work_run_job);
@@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler 
*sched)
  void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
  {
if (drm_sched_can_queue(sched))
-   __drm_sched_run_job_queue(sched);
+   drm_sched_run_job_queue(sched);
  }
  
  /**

@@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
**sched_list,
  }
  EXPORT_SYMBOL(drm_sched_pick_best);
  
-/**

- * drm_sched_run_job_queue - enqueue run-job work if there are ready entities
- * @sched: scheduler instance
- */
-static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
-{
-   if (drm_sched_select_entity(sched))


Hm, now that I rebase my patch to implement dynamic job-flow control I 
recognize that
we probably need the peek semantics here. If we do not select an entity here, 
we also
do not check whether the corresponding job fits on the ring.

Alternatively, we simply can't do this check in drm_sched_wakeup(). The 
consequence would
be that we don't detect that we need to wait for credits to free up before the 
run work is
already executing and the run work selects an entity.

- Danilo


-   __drm_sched_run_job_queue(sched);
-}
-
  /**
   * drm_sched_free_job_work - worker to call free_job
   *

base-commit: 27d9620e9a9a6bc27a646b464b85860d91e21af3




Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-07 Thread Matthew Brost
On Mon, Nov 06, 2023 at 11:10:21PM -0500, Luben Tuikov wrote:
> Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
> it do just that, schedule the work item for execution.
> 
> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
> to determine if the scheduler has an entity ready in one of its run-queues,
> and in the case of the Round-Robin (RR) scheduling, the function
> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
> which is ready, sets up the run-queue and completion and returns that
> entity. The FIFO scheduling algorithm is unaffected.
> 
> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then
> in the case of RR scheduling, that would result in drm_sched_select_entity()
> having been called twice, which may result in skipping a ready entity if more
> than one entity is ready. This commit fixes this by eliminating the call to
> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
> in drm_sched_run_job_work().
> 
> v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
> Add fixes-tag. (Tvrtko)
> 
> Signed-off-by: Luben Tuikov 
> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")

Reviewed-by: Matthew Brost 

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 16 +++-
>  1 file changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 27843e37d9b769..cd0dc3f81d05f0 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>  }
>  
>  /**
> - * __drm_sched_run_job_queue - enqueue run-job work
> + * drm_sched_run_job_queue - enqueue run-job work
>   * @sched: scheduler instance
>   */
> -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>  {
>   if (!READ_ONCE(sched->pause_submit))
>   queue_work(sched->submit_wq, >work_run_job);
> @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler 
> *sched)
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
>  {
>   if (drm_sched_can_queue(sched))
> - __drm_sched_run_job_queue(sched);
> + drm_sched_run_job_queue(sched);
>  }
>  
>  /**
> @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
> **sched_list,
>  }
>  EXPORT_SYMBOL(drm_sched_pick_best);
>  
> -/**
> - * drm_sched_run_job_queue - enqueue run-job work if there are ready entities
> - * @sched: scheduler instance
> - */
> -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> -{
> - if (drm_sched_select_entity(sched))
> - __drm_sched_run_job_queue(sched);
> -}
> -
>  /**
>   * drm_sched_free_job_work - worker to call free_job
>   *
> 
> base-commit: 27d9620e9a9a6bc27a646b464b85860d91e21af3
> -- 
> 2.42.1
> 


Re: [PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-07 Thread Danilo Krummrich

On 11/7/23 05:10, Luben Tuikov wrote:

Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
it do just that, schedule the work item for execution.

The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
to determine if the scheduler has an entity ready in one of its run-queues,
and in the case of the Round-Robin (RR) scheduling, the function
drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
which is ready, sets up the run-queue and completion and returns that
entity. The FIFO scheduling algorithm is unaffected.

Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then
in the case of RR scheduling, that would result in drm_sched_select_entity()
having been called twice, which may result in skipping a ready entity if more
than one entity is ready. This commit fixes this by eliminating the call to
drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
in drm_sched_run_job_work().

v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
 Add fixes-tag. (Tvrtko)

Signed-off-by: Luben Tuikov 
Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")


Reviewed-by: Danilo Krummrich 


---
  drivers/gpu/drm/scheduler/sched_main.c | 16 +++-
  1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 27843e37d9b769..cd0dc3f81d05f0 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
  }
  
  /**

- * __drm_sched_run_job_queue - enqueue run-job work
+ * drm_sched_run_job_queue - enqueue run-job work
   * @sched: scheduler instance
   */
-static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
+static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
  {
if (!READ_ONCE(sched->pause_submit))
queue_work(sched->submit_wq, >work_run_job);
@@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler 
*sched)
  void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
  {
if (drm_sched_can_queue(sched))
-   __drm_sched_run_job_queue(sched);
+   drm_sched_run_job_queue(sched);
  }
  
  /**

@@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
**sched_list,
  }
  EXPORT_SYMBOL(drm_sched_pick_best);
  
-/**

- * drm_sched_run_job_queue - enqueue run-job work if there are ready entities
- * @sched: scheduler instance
- */
-static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
-{
-   if (drm_sched_select_entity(sched))
-   __drm_sched_run_job_queue(sched);
-}
-
  /**
   * drm_sched_free_job_work - worker to call free_job
   *

base-commit: 27d9620e9a9a6bc27a646b464b85860d91e21af3




[PATCH] drm/sched: Don't disturb the entity when in RR-mode scheduling

2023-11-06 Thread Luben Tuikov
Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
it do just that, schedule the work item for execution.

The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
to determine if the scheduler has an entity ready in one of its run-queues,
and in the case of the Round-Robin (RR) scheduling, the function
drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
which is ready, sets up the run-queue and completion and returns that
entity. The FIFO scheduling algorithm is unaffected.

Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then
in the case of RR scheduling, that would result in drm_sched_select_entity()
having been called twice, which may result in skipping a ready entity if more
than one entity is ready. This commit fixes this by eliminating the call to
drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
in drm_sched_run_job_work().

v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
Add fixes-tag. (Tvrtko)

Signed-off-by: Luben Tuikov 
Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
---
 drivers/gpu/drm/scheduler/sched_main.c | 16 +++-
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 27843e37d9b769..cd0dc3f81d05f0 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
 }
 
 /**
- * __drm_sched_run_job_queue - enqueue run-job work
+ * drm_sched_run_job_queue - enqueue run-job work
  * @sched: scheduler instance
  */
-static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
+static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
 {
if (!READ_ONCE(sched->pause_submit))
queue_work(sched->submit_wq, >work_run_job);
@@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler 
*sched)
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
 {
if (drm_sched_can_queue(sched))
-   __drm_sched_run_job_queue(sched);
+   drm_sched_run_job_queue(sched);
 }
 
 /**
@@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler 
**sched_list,
 }
 EXPORT_SYMBOL(drm_sched_pick_best);
 
-/**
- * drm_sched_run_job_queue - enqueue run-job work if there are ready entities
- * @sched: scheduler instance
- */
-static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
-{
-   if (drm_sched_select_entity(sched))
-   __drm_sched_run_job_queue(sched);
-}
-
 /**
  * drm_sched_free_job_work - worker to call free_job
  *

base-commit: 27d9620e9a9a6bc27a646b464b85860d91e21af3
-- 
2.42.1