Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

2023-11-01 Thread Luben Tuikov
On 2023-10-31 22:23, Danilo Krummrich wrote:
> Hi Luben,
> 

[snip]
>>> @@ -187,12 +251,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq 
>>> *rq,
>>>  /**
>>>   * drm_sched_rq_select_entity_rr - Select an entity which could provide a 
>>> job to run
>>>   *
>>> + * @sched: the gpu scheduler
>>>   * @rq: scheduler run queue to check.
>>>   *
>>>   * Try to find a ready entity, returns NULL if none found.
>>>   */
>>>  static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
>>> + struct drm_sched_rq *rq)
>>>  {
>>> struct drm_sched_entity *entity;
>>>  
>>> @@ -202,6 +268,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> if (entity) {
>>> list_for_each_entry_continue(entity, >entities, list) {
>>> if (drm_sched_entity_is_ready(entity)) {
>>> +   /* If we can't queue yet, preserve the current
>>> +* entity in terms of fairness.
>>> +*/
>>> +   if (!drm_sched_can_queue(sched, entity)) {
>>> +   spin_unlock(>lock);
>>> +   return ERR_PTR(-ENOSPC);
>>> +   }
>>> +
>>> rq->current_entity = entity;
>>> reinit_completion(>entity_idle);
>>> spin_unlock(>lock);
>>> @@ -211,8 +285,15 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> }
>>>  
>>> list_for_each_entry(entity, >entities, list) {
>>> -
>>> if (drm_sched_entity_is_ready(entity)) {
>>> +   /* If we can't queue yet, preserve the current entity in
>>> +* terms of fairness.
>>> +*/
>>> +   if (!drm_sched_can_queue(sched, entity)) {
>>> +   spin_unlock(>lock);
>>> +   return ERR_PTR(-ENOSPC);
>>> +   }
>>> +
>>> rq->current_entity = entity;
>>> reinit_completion(>entity_idle);
>>> spin_unlock(>lock);
>>> @@ -231,12 +312,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>  /**
>>>   * drm_sched_rq_select_entity_fifo - Select an entity which provides a job 
>>> to run
>>>   *
>>> + * @sched: the gpu scheduler
>>>   * @rq: scheduler run queue to check.
>>>   *
>>>   * Find oldest waiting ready entity, returns NULL if none found.
>>>   */
>>>  static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
>>> +   struct drm_sched_rq *rq)
>>>  {
>>> struct rb_node *rb;
>>>  
>>> @@ -246,6 +329,14 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq 
>>> *rq)
>>>  
>>> entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>> if (drm_sched_entity_is_ready(entity)) {
>>> +   /* If we can't queue yet, preserve the current entity in
>>> +* terms of fairness.
>>> +*/
>>> +   if (!drm_sched_can_queue(sched, entity)) {
>>> +   spin_unlock(>lock);
>>> +   return ERR_PTR(-ENOSPC);
>>> +   }
>>> +
>>
>> So, this kinda did this abrupt "return" in v2, then it was fixed fine in v3,
>> and now I see we return and return an error now, which doesn't seem to be 
>> used
>> by anyone. In fact, in drm_sched_select_entity(), we do this,
>>
>> -return entity;
>> +return IS_ERR(entity) ? NULL : entity;
>>
>> So, it's perhaps best to leave this as it was in v3, and if in the future
>> we need to distinguish between the type of error, then that future patch
>> would do that and also show how this is used with new code and logic.
>>
>> There is little value to over-engineer this right now, given that in
>> the future, the logic may be more esoteric than we can think of.
> 
> Ok, maybe what I do here is a little bit subtle and requires a comment. Let me
> explain.
> 
> The reason I return an ERR_PTR() instead of NULL is to indicate to
> drm_sched_select_entity() to break out of the runqueue loop
> (drm_sched_select_entity() breaks the loop when the returned entity is not
> NULL).
> 
> Without that, we'd continue the runqueue loop in drm_sched_select_entity() and
> retry with the next lower priority. This could lead to prioritiy inversion of
> the runqueues, since a lower priority runqueue with jobs with less credits 
> could
> stall a higher priority runqueue with jobs with more credits.
> 
> Hence, in drm_sched_select_entity() we need to be able to distinguish between
> drm_sched_rq_select_entity_fifo()/drm_sched_rq_select_entity_rr() can't find 
> an
> entity and they *can* find an entity, 

Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

2023-10-31 Thread Danilo Krummrich
Hi Luben,

On Tue, Oct 31, 2023 at 08:51:17PM -0400, Luben Tuikov wrote:
> Hi,
> 
> (PSA: luben.tui...@amd.com should've bounced :-) I'm removing it from the To: 
> field.)
> 
> On 2023-10-30 20:26, Danilo Krummrich wrote:
> > Currently, job flow control is implemented simply by limiting the number
> > of jobs in flight. Therefore, a scheduler is initialized with a credit
> > limit that corresponds to the number of jobs which can be sent to the
> > hardware.
> > 
> > This implies that for each job, drivers need to account for the maximum
> > job size possible in order to not overflow the ring buffer.
> > 
> > However, there are drivers, such as Nouveau, where the job size has a
> > rather large range. For such drivers it can easily happen that job
> > submissions not even filling the ring by 1% can block subsequent
> > submissions, which, in the worst case, can lead to the ring run dry.
> > 
> > In order to overcome this issue, allow for tracking the actual job size
> > instead of the number of jobs. Therefore, add a field to track a job's
> > credit count, which represents the number of credits a job contributes
> > to the scheduler's credit limit.
> > 
> > Signed-off-by: Danilo Krummrich 
> > ---
> > Changes in V2:
> > ==
> >   - fixed up influence on scheduling fairness due to consideration of a 
> > job's
> > size
> > - If we reach a ready entity in drm_sched_select_entity() but can't 
> > actually
> >   queue a job from it due to size limitations, just give up and go to 
> > sleep
> >   until woken up due to a pending job finishing, rather than continue 
> > to try
> >   other entities.
> >   - added a callback to dynamically update a job's credits (Boris)
> >   - renamed 'units' to 'credits'
> >   - fixed commit message and comments as requested by Luben
> > 
> > Changes in V3:
> > ==
> >   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
> >   - move up drm_sched_can_queue() instead of adding a forward declaration
> > (Boris)
> >   - add a drm_sched_available_credits() helper (Boris)
> >   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's 
> > proposal
> >   - re-phrase a few comments and fix a typo (Luben)
> >   - change naming of all structures credit fields and function parameters 
> > to the
> > following scheme
> > - drm_sched_job::credits
> > - drm_gpu_scheduler::credit_limit
> > - drm_gpu_scheduler::credit_count
> > - drm_sched_init(..., u32 credit_limit, ...)
> > - drm_sched_job_init(..., u32 credits, ...)
> >   - add proper documentation for the scheduler's job-flow control mechanism
> > 
> > Changes in V4:
> > ==
> >   - address Lubens comments regarding documentation
> >   - switch to drm_WARN() variants
> >   - WARN_ON() drivers passing in zero job credits for both 
> > drm_sched_job_init()
> > and the update_job_credits() callback
> >   - don't retry with another runq if job doesn't fit on the ring to prevent
> > priority inversion
> >   - rebase onto drm-misc-next (will probably land before Matt's series)
> > 
> > Patch also available in [1].
> > 
> > [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> > ---
> >  Documentation/gpu/drm-mm.rst  |   6 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
> >  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
> >  drivers/gpu/drm/lima/lima_sched.c |   2 +-
> >  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
> >  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
> >  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
> >  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
> >  drivers/gpu/drm/scheduler/sched_entity.c  |   4 +-
> >  drivers/gpu/drm/scheduler/sched_main.c| 148 ++
> >  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
> >  include/drm/gpu_scheduler.h   |  31 +++-
> >  12 files changed, 156 insertions(+), 49 deletions(-)
> > 
> > diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> > index 602010cb6894..acc5901ac840 100644
> > --- a/Documentation/gpu/drm-mm.rst
> > +++ b/Documentation/gpu/drm-mm.rst
> > @@ -552,6 +552,12 @@ Overview
> >  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > :doc: Overview
> >  
> > +Flow Control
> > +
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > +   :doc: Flow Control
> > +
> >  Scheduler Function References
> >  -
> >  
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 1f357198533f..62bb7fc7448a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
> > amdgpu_vm *vm,
> > if (!entity)
> > return 0;
> >  
> > -   return 

Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

2023-10-31 Thread Luben Tuikov
Hi,

(PSA: luben.tui...@amd.com should've bounced :-) I'm removing it from the To: 
field.)

On 2023-10-30 20:26, Danilo Krummrich wrote:
> Currently, job flow control is implemented simply by limiting the number
> of jobs in flight. Therefore, a scheduler is initialized with a credit
> limit that corresponds to the number of jobs which can be sent to the
> hardware.
> 
> This implies that for each job, drivers need to account for the maximum
> job size possible in order to not overflow the ring buffer.
> 
> However, there are drivers, such as Nouveau, where the job size has a
> rather large range. For such drivers it can easily happen that job
> submissions not even filling the ring by 1% can block subsequent
> submissions, which, in the worst case, can lead to the ring run dry.
> 
> In order to overcome this issue, allow for tracking the actual job size
> instead of the number of jobs. Therefore, add a field to track a job's
> credit count, which represents the number of credits a job contributes
> to the scheduler's credit limit.
> 
> Signed-off-by: Danilo Krummrich 
> ---
> Changes in V2:
> ==
>   - fixed up influence on scheduling fairness due to consideration of a job's
> size
> - If we reach a ready entity in drm_sched_select_entity() but can't 
> actually
>   queue a job from it due to size limitations, just give up and go to 
> sleep
>   until woken up due to a pending job finishing, rather than continue to 
> try
>   other entities.
>   - added a callback to dynamically update a job's credits (Boris)
>   - renamed 'units' to 'credits'
>   - fixed commit message and comments as requested by Luben
> 
> Changes in V3:
> ==
>   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>   - move up drm_sched_can_queue() instead of adding a forward declaration
> (Boris)
>   - add a drm_sched_available_credits() helper (Boris)
>   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's 
> proposal
>   - re-phrase a few comments and fix a typo (Luben)
>   - change naming of all structures credit fields and function parameters to 
> the
> following scheme
> - drm_sched_job::credits
> - drm_gpu_scheduler::credit_limit
> - drm_gpu_scheduler::credit_count
> - drm_sched_init(..., u32 credit_limit, ...)
> - drm_sched_job_init(..., u32 credits, ...)
>   - add proper documentation for the scheduler's job-flow control mechanism
> 
> Changes in V4:
> ==
>   - address Lubens comments regarding documentation
>   - switch to drm_WARN() variants
>   - WARN_ON() drivers passing in zero job credits for both 
> drm_sched_job_init()
> and the update_job_credits() callback
>   - don't retry with another runq if job doesn't fit on the ring to prevent
> priority inversion
>   - rebase onto drm-misc-next (will probably land before Matt's series)
> 
> Patch also available in [1].
> 
> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> ---
>  Documentation/gpu/drm-mm.rst  |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>  drivers/gpu/drm/lima/lima_sched.c |   2 +-
>  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>  drivers/gpu/drm/scheduler/sched_entity.c  |   4 +-
>  drivers/gpu/drm/scheduler/sched_main.c| 148 ++
>  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
>  include/drm/gpu_scheduler.h   |  31 +++-
>  12 files changed, 156 insertions(+), 49 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index 602010cb6894..acc5901ac840 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> :doc: Overview
>  
> +Flow Control
> +
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Flow Control
> +
>  Scheduler Function References
>  -
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 1f357198533f..62bb7fc7448a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
> amdgpu_vm *vm,
>   if (!entity)
>   return 0;
>  
> - return drm_sched_job_init(&(*job)->base, entity, owner);
> + return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>  }
>  
>  int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c 
> b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> index 

Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

2023-10-31 Thread Danilo Krummrich

On 10/31/23 12:13, Christian König wrote:



Am 31.10.23 um 01:26 schrieb Danilo Krummrich:

Currently, job flow control is implemented simply by limiting the number
of jobs in flight. Therefore, a scheduler is initialized with a credit
limit that corresponds to the number of jobs which can be sent to the
hardware.

This implies that for each job, drivers need to account for the maximum
job size possible in order to not overflow the ring buffer.

However, there are drivers, such as Nouveau, where the job size has a
rather large range. For such drivers it can easily happen that job
submissions not even filling the ring by 1% can block subsequent
submissions, which, in the worst case, can lead to the ring run dry.

In order to overcome this issue, allow for tracking the actual job size
instead of the number of jobs. Therefore, add a field to track a job's
credit count, which represents the number of credits a job contributes
to the scheduler's credit limit.

Signed-off-by: Danilo Krummrich 
---
Changes in V2:
==
   - fixed up influence on scheduling fairness due to consideration of a job's
 size
 - If we reach a ready entity in drm_sched_select_entity() but can't 
actually
   queue a job from it due to size limitations, just give up and go to sleep
   until woken up due to a pending job finishing, rather than continue to 
try
   other entities.
   - added a callback to dynamically update a job's credits (Boris)
   - renamed 'units' to 'credits'
   - fixed commit message and comments as requested by Luben

Changes in V3:
==
   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
   - move up drm_sched_can_queue() instead of adding a forward declaration
 (Boris)
   - add a drm_sched_available_credits() helper (Boris)
   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's 
proposal
   - re-phrase a few comments and fix a typo (Luben)
   - change naming of all structures credit fields and function parameters to 
the
 following scheme
 - drm_sched_job::credits
 - drm_gpu_scheduler::credit_limit
 - drm_gpu_scheduler::credit_count
 - drm_sched_init(..., u32 credit_limit, ...)
 - drm_sched_job_init(..., u32 credits, ...)
   - add proper documentation for the scheduler's job-flow control mechanism

Changes in V4:
==
   - address Lubens comments regarding documentation
   - switch to drm_WARN() variants
   - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init()
 and the update_job_credits() callback
   - don't retry with another runq if job doesn't fit on the ring to prevent
 priority inversion
   - rebase onto drm-misc-next (will probably land before Matt's series)

Patch also available in [1].

[1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
---
  Documentation/gpu/drm-mm.rst  |   6 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
  drivers/gpu/drm/lima/lima_sched.c |   2 +-
  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
  drivers/gpu/drm/scheduler/sched_entity.c  |   4 +-
  drivers/gpu/drm/scheduler/sched_main.c    | 148 ++
  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
  include/drm/gpu_scheduler.h   |  31 +++-
  12 files changed, 156 insertions(+), 49 deletions(-)

diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index 602010cb6894..acc5901ac840 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -552,6 +552,12 @@ Overview
  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
 :doc: Overview
+Flow Control
+
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Flow Control
+
  Scheduler Function References
  -
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 1f357198533f..62bb7fc7448a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
  if (!entity)
  return 0;
-    return drm_sched_job_init(&(*job)->base, entity, owner);
+    return drm_sched_job_init(&(*job)->base, entity, 1, owner);
  }
  int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c 
b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
index 2416c526f9b0..3d0f8d182506 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
@@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void 
*data,
  ret = 

Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

2023-10-31 Thread Christian König




Am 31.10.23 um 01:26 schrieb Danilo Krummrich:

Currently, job flow control is implemented simply by limiting the number
of jobs in flight. Therefore, a scheduler is initialized with a credit
limit that corresponds to the number of jobs which can be sent to the
hardware.

This implies that for each job, drivers need to account for the maximum
job size possible in order to not overflow the ring buffer.

However, there are drivers, such as Nouveau, where the job size has a
rather large range. For such drivers it can easily happen that job
submissions not even filling the ring by 1% can block subsequent
submissions, which, in the worst case, can lead to the ring run dry.

In order to overcome this issue, allow for tracking the actual job size
instead of the number of jobs. Therefore, add a field to track a job's
credit count, which represents the number of credits a job contributes
to the scheduler's credit limit.

Signed-off-by: Danilo Krummrich 
---
Changes in V2:
==
   - fixed up influence on scheduling fairness due to consideration of a job's
 size
 - If we reach a ready entity in drm_sched_select_entity() but can't 
actually
   queue a job from it due to size limitations, just give up and go to sleep
   until woken up due to a pending job finishing, rather than continue to 
try
   other entities.
   - added a callback to dynamically update a job's credits (Boris)
   - renamed 'units' to 'credits'
   - fixed commit message and comments as requested by Luben

Changes in V3:
==
   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
   - move up drm_sched_can_queue() instead of adding a forward declaration
 (Boris)
   - add a drm_sched_available_credits() helper (Boris)
   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's 
proposal
   - re-phrase a few comments and fix a typo (Luben)
   - change naming of all structures credit fields and function parameters to 
the
 following scheme
 - drm_sched_job::credits
 - drm_gpu_scheduler::credit_limit
 - drm_gpu_scheduler::credit_count
 - drm_sched_init(..., u32 credit_limit, ...)
 - drm_sched_job_init(..., u32 credits, ...)
   - add proper documentation for the scheduler's job-flow control mechanism

Changes in V4:
==
   - address Lubens comments regarding documentation
   - switch to drm_WARN() variants
   - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init()
 and the update_job_credits() callback
   - don't retry with another runq if job doesn't fit on the ring to prevent
 priority inversion
   - rebase onto drm-misc-next (will probably land before Matt's series)

Patch also available in [1].

[1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
---
  Documentation/gpu/drm-mm.rst  |   6 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
  drivers/gpu/drm/lima/lima_sched.c |   2 +-
  drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
  drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
  drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
  drivers/gpu/drm/scheduler/sched_entity.c  |   4 +-
  drivers/gpu/drm/scheduler/sched_main.c| 148 ++
  drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
  include/drm/gpu_scheduler.h   |  31 +++-
  12 files changed, 156 insertions(+), 49 deletions(-)

diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index 602010cb6894..acc5901ac840 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -552,6 +552,12 @@ Overview
  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
 :doc: Overview
  
+Flow Control

+
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Flow Control
+
  Scheduler Function References
  -
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c

index 1f357198533f..62bb7fc7448a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
if (!entity)
return 0;
  
-	return drm_sched_job_init(&(*job)->base, entity, owner);

+   return drm_sched_job_init(&(*job)->base, entity, 1, owner);
  }
  
  int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c 
b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
index 2416c526f9b0..3d0f8d182506 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
@@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void 
*data,
  
  	ret = drm_sched_job_init(>sched_job,

 

Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

2023-10-31 Thread kernel test robot
Hi Danilo,

kernel test robot noticed the following build errors:

[auto build test ERROR on b2139fb5051554a7f297e4ad584ef1bc26c76d5d]

url:
https://github.com/intel-lab-lkp/linux/commits/Danilo-Krummrich/drm-sched-implement-dynamic-job-flow-control/20231031-082915
base:   b2139fb5051554a7f297e4ad584ef1bc26c76d5d
patch link:
https://lore.kernel.org/r/20231031002655.38707-1-dakr%40redhat.com
patch subject: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow 
control
config: loongarch-randconfig-002-20231031 
(https://download.01.org/0day-ci/archive/20231031/202310311632.rpehvhmk-...@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231031/202310311632.rpehvhmk-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202310311632.rpehvhmk-...@intel.com/

All errors (new ones prefixed by >>):

   drivers/gpu/drm/etnaviv/etnaviv_gpu.c: In function 'etnaviv_gpu_rpm_suspend':
>> drivers/gpu/drm/etnaviv/etnaviv_gpu.c:1920:36: error: 'struct 
>> drm_gpu_scheduler' has no member named 'hw_rq_count'
1920 | if (atomic_read(>sched.hw_rq_count))
 |^


vim +1920 drivers/gpu/drm/etnaviv/etnaviv_gpu.c

a8c21a5451d831 The etnaviv authors 2015-12-03  1913  
a8c21a5451d831 The etnaviv authors 2015-12-03  1914  static int 
etnaviv_gpu_rpm_suspend(struct device *dev)
a8c21a5451d831 The etnaviv authors 2015-12-03  1915  {
a8c21a5451d831 The etnaviv authors 2015-12-03  1916 struct etnaviv_gpu *gpu 
= dev_get_drvdata(dev);
a8c21a5451d831 The etnaviv authors 2015-12-03  1917 u32 idle, mask;
a8c21a5451d831 The etnaviv authors 2015-12-03  1918  
f4163814813fb3 Lucas Stach 2018-11-05  1919 /* If there are any 
jobs in the HW queue, we're not idle */
f4163814813fb3 Lucas Stach 2018-11-05 @1920 if 
(atomic_read(>sched.hw_rq_count))
a8c21a5451d831 The etnaviv authors 2015-12-03  1921 return -EBUSY;
a8c21a5451d831 The etnaviv authors 2015-12-03  1922  
1a910c11d35bfa Guido Günther   2020-03-02  1923 /* Check whether the 
hardware (except FE and MC) is idle */
1a910c11d35bfa Guido Günther   2020-03-02  1924 mask = gpu->idle_mask & 
~(VIVS_HI_IDLE_STATE_FE |
1a910c11d35bfa Guido Günther   2020-03-02  1925 
  VIVS_HI_IDLE_STATE_MC);
a8c21a5451d831 The etnaviv authors 2015-12-03  1926 idle = gpu_read(gpu, 
VIVS_HI_IDLE_STATE) & mask;
78f2bfa3181cd7 Guido Günther   2020-03-02  1927 if (idle != mask) {
78f2bfa3181cd7 Guido Günther   2020-03-02  1928 
dev_warn_ratelimited(dev, "GPU not yet idle, mask: 0x%08x\n",
78f2bfa3181cd7 Guido Günther   2020-03-02  1929 
 idle);
a8c21a5451d831 The etnaviv authors 2015-12-03  1930 return -EBUSY;
78f2bfa3181cd7 Guido Günther   2020-03-02  1931 }
a8c21a5451d831 The etnaviv authors 2015-12-03  1932  
7cb544946a138d Lucas Stach 2023-06-07  1933 
etnaviv_gpu_hw_suspend(gpu);
7cb544946a138d Lucas Stach 2023-06-07  1934  
647d817d807127 Lucas Stach 2023-06-07  1935 gpu->state = 
ETNA_GPU_STATE_IDENTIFIED;
647d817d807127 Lucas Stach 2023-06-07  1936  
7cb544946a138d Lucas Stach 2023-06-07  1937 return 
etnaviv_gpu_clk_disable(gpu);
a8c21a5451d831 The etnaviv authors 2015-12-03  1938  }
a8c21a5451d831 The etnaviv authors 2015-12-03  1939  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

2023-10-31 Thread kernel test robot
Hi Danilo,

kernel test robot noticed the following build errors:

[auto build test ERROR on b2139fb5051554a7f297e4ad584ef1bc26c76d5d]

url:
https://github.com/intel-lab-lkp/linux/commits/Danilo-Krummrich/drm-sched-implement-dynamic-job-flow-control/20231031-082915
base:   b2139fb5051554a7f297e4ad584ef1bc26c76d5d
patch link:
https://lore.kernel.org/r/20231031002655.38707-1-dakr%40redhat.com
patch subject: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow 
control
config: arc-randconfig-002-20231031 
(https://download.01.org/0day-ci/archive/20231031/202310311534.i8wqwok0-...@intel.com/config)
compiler: arceb-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231031/202310311534.i8wqwok0-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202310311534.i8wqwok0-...@intel.com/

All errors (new ones prefixed by >>):

   drivers/gpu/drm/panfrost/panfrost_job.c: In function 'panfrost_job_is_idle':
>> drivers/gpu/drm/panfrost/panfrost_job.c:966:52: error: 'struct 
>> drm_gpu_scheduler' has no member named 'hw_rq_count'
 966 | if (atomic_read(>queue[i].sched.hw_rq_count))
 |^


vim +966 drivers/gpu/drm/panfrost/panfrost_job.c

f3ba91228e8e91 Rob Herring 2018-09-10  958  
f3ba91228e8e91 Rob Herring 2018-09-10  959  int panfrost_job_is_idle(struct 
panfrost_device *pfdev)
f3ba91228e8e91 Rob Herring 2018-09-10  960  {
f3ba91228e8e91 Rob Herring 2018-09-10  961  struct panfrost_job_slot *js = 
pfdev->js;
f3ba91228e8e91 Rob Herring 2018-09-10  962  int i;
f3ba91228e8e91 Rob Herring 2018-09-10  963  
f3ba91228e8e91 Rob Herring 2018-09-10  964  for (i = 0; i < NUM_JOB_SLOTS; 
i++) {
f3ba91228e8e91 Rob Herring 2018-09-10  965  /* If there are any 
jobs in the HW queue, we're not idle */
f3ba91228e8e91 Rob Herring 2018-09-10 @966  if 
(atomic_read(>queue[i].sched.hw_rq_count))

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


[PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

2023-10-30 Thread Danilo Krummrich
Currently, job flow control is implemented simply by limiting the number
of jobs in flight. Therefore, a scheduler is initialized with a credit
limit that corresponds to the number of jobs which can be sent to the
hardware.

This implies that for each job, drivers need to account for the maximum
job size possible in order to not overflow the ring buffer.

However, there are drivers, such as Nouveau, where the job size has a
rather large range. For such drivers it can easily happen that job
submissions not even filling the ring by 1% can block subsequent
submissions, which, in the worst case, can lead to the ring run dry.

In order to overcome this issue, allow for tracking the actual job size
instead of the number of jobs. Therefore, add a field to track a job's
credit count, which represents the number of credits a job contributes
to the scheduler's credit limit.

Signed-off-by: Danilo Krummrich 
---
Changes in V2:
==
  - fixed up influence on scheduling fairness due to consideration of a job's
size
- If we reach a ready entity in drm_sched_select_entity() but can't actually
  queue a job from it due to size limitations, just give up and go to sleep
  until woken up due to a pending job finishing, rather than continue to try
  other entities.
  - added a callback to dynamically update a job's credits (Boris)
  - renamed 'units' to 'credits'
  - fixed commit message and comments as requested by Luben

Changes in V3:
==
  - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
  - move up drm_sched_can_queue() instead of adding a forward declaration
(Boris)
  - add a drm_sched_available_credits() helper (Boris)
  - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
  - re-phrase a few comments and fix a typo (Luben)
  - change naming of all structures credit fields and function parameters to the
following scheme
- drm_sched_job::credits
- drm_gpu_scheduler::credit_limit
- drm_gpu_scheduler::credit_count
- drm_sched_init(..., u32 credit_limit, ...)
- drm_sched_job_init(..., u32 credits, ...)
  - add proper documentation for the scheduler's job-flow control mechanism

Changes in V4:
==
  - address Lubens comments regarding documentation
  - switch to drm_WARN() variants
  - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init()
and the update_job_credits() callback
  - don't retry with another runq if job doesn't fit on the ring to prevent
priority inversion
  - rebase onto drm-misc-next (will probably land before Matt's series)

Patch also available in [1].

[1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
---
 Documentation/gpu/drm-mm.rst  |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
 drivers/gpu/drm/lima/lima_sched.c |   2 +-
 drivers/gpu/drm/msm/msm_gem_submit.c  |   2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c   |   2 +-
 drivers/gpu/drm/panfrost/panfrost_drv.c   |   2 +-
 .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
 drivers/gpu/drm/scheduler/sched_entity.c  |   4 +-
 drivers/gpu/drm/scheduler/sched_main.c| 148 ++
 drivers/gpu/drm/v3d/v3d_gem.c |   2 +-
 include/drm/gpu_scheduler.h   |  31 +++-
 12 files changed, 156 insertions(+), 49 deletions(-)

diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index 602010cb6894..acc5901ac840 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -552,6 +552,12 @@ Overview
 .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
:doc: Overview
 
+Flow Control
+
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Flow Control
+
 Scheduler Function References
 -
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 1f357198533f..62bb7fc7448a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
if (!entity)
return 0;
 
-   return drm_sched_job_init(&(*job)->base, entity, owner);
+   return drm_sched_job_init(&(*job)->base, entity, 1, owner);
 }
 
 int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c 
b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
index 2416c526f9b0..3d0f8d182506 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
@@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void 
*data,
 
ret = drm_sched_job_init(>sched_job,
 >sched_entity[args->pipe],
-submit->ctx);