Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control
On 2023-10-31 22:23, Danilo Krummrich wrote: > Hi Luben, > [snip] >>> @@ -187,12 +251,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq >>> *rq, >>> /** >>> * drm_sched_rq_select_entity_rr - Select an entity which could provide a >>> job to run >>> * >>> + * @sched: the gpu scheduler >>> * @rq: scheduler run queue to check. >>> * >>> * Try to find a ready entity, returns NULL if none found. >>> */ >>> static struct drm_sched_entity * >>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq) >>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched, >>> + struct drm_sched_rq *rq) >>> { >>> struct drm_sched_entity *entity; >>> >>> @@ -202,6 +268,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq) >>> if (entity) { >>> list_for_each_entry_continue(entity, >entities, list) { >>> if (drm_sched_entity_is_ready(entity)) { >>> + /* If we can't queue yet, preserve the current >>> +* entity in terms of fairness. >>> +*/ >>> + if (!drm_sched_can_queue(sched, entity)) { >>> + spin_unlock(>lock); >>> + return ERR_PTR(-ENOSPC); >>> + } >>> + >>> rq->current_entity = entity; >>> reinit_completion(>entity_idle); >>> spin_unlock(>lock); >>> @@ -211,8 +285,15 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq) >>> } >>> >>> list_for_each_entry(entity, >entities, list) { >>> - >>> if (drm_sched_entity_is_ready(entity)) { >>> + /* If we can't queue yet, preserve the current entity in >>> +* terms of fairness. >>> +*/ >>> + if (!drm_sched_can_queue(sched, entity)) { >>> + spin_unlock(>lock); >>> + return ERR_PTR(-ENOSPC); >>> + } >>> + >>> rq->current_entity = entity; >>> reinit_completion(>entity_idle); >>> spin_unlock(>lock); >>> @@ -231,12 +312,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq) >>> /** >>> * drm_sched_rq_select_entity_fifo - Select an entity which provides a job >>> to run >>> * >>> + * @sched: the gpu scheduler >>> * @rq: scheduler run queue to check. >>> * >>> * Find oldest waiting ready entity, returns NULL if none found. >>> */ >>> static struct drm_sched_entity * >>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) >>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched, >>> + struct drm_sched_rq *rq) >>> { >>> struct rb_node *rb; >>> >>> @@ -246,6 +329,14 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq >>> *rq) >>> >>> entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node); >>> if (drm_sched_entity_is_ready(entity)) { >>> + /* If we can't queue yet, preserve the current entity in >>> +* terms of fairness. >>> +*/ >>> + if (!drm_sched_can_queue(sched, entity)) { >>> + spin_unlock(>lock); >>> + return ERR_PTR(-ENOSPC); >>> + } >>> + >> >> So, this kinda did this abrupt "return" in v2, then it was fixed fine in v3, >> and now I see we return and return an error now, which doesn't seem to be >> used >> by anyone. In fact, in drm_sched_select_entity(), we do this, >> >> -return entity; >> +return IS_ERR(entity) ? NULL : entity; >> >> So, it's perhaps best to leave this as it was in v3, and if in the future >> we need to distinguish between the type of error, then that future patch >> would do that and also show how this is used with new code and logic. >> >> There is little value to over-engineer this right now, given that in >> the future, the logic may be more esoteric than we can think of. > > Ok, maybe what I do here is a little bit subtle and requires a comment. Let me > explain. > > The reason I return an ERR_PTR() instead of NULL is to indicate to > drm_sched_select_entity() to break out of the runqueue loop > (drm_sched_select_entity() breaks the loop when the returned entity is not > NULL). > > Without that, we'd continue the runqueue loop in drm_sched_select_entity() and > retry with the next lower priority. This could lead to prioritiy inversion of > the runqueues, since a lower priority runqueue with jobs with less credits > could > stall a higher priority runqueue with jobs with more credits. > > Hence, in drm_sched_select_entity() we need to be able to distinguish between > drm_sched_rq_select_entity_fifo()/drm_sched_rq_select_entity_rr() can't find > an > entity and they *can* find an entity,
Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control
Hi Luben, On Tue, Oct 31, 2023 at 08:51:17PM -0400, Luben Tuikov wrote: > Hi, > > (PSA: luben.tui...@amd.com should've bounced :-) I'm removing it from the To: > field.) > > On 2023-10-30 20:26, Danilo Krummrich wrote: > > Currently, job flow control is implemented simply by limiting the number > > of jobs in flight. Therefore, a scheduler is initialized with a credit > > limit that corresponds to the number of jobs which can be sent to the > > hardware. > > > > This implies that for each job, drivers need to account for the maximum > > job size possible in order to not overflow the ring buffer. > > > > However, there are drivers, such as Nouveau, where the job size has a > > rather large range. For such drivers it can easily happen that job > > submissions not even filling the ring by 1% can block subsequent > > submissions, which, in the worst case, can lead to the ring run dry. > > > > In order to overcome this issue, allow for tracking the actual job size > > instead of the number of jobs. Therefore, add a field to track a job's > > credit count, which represents the number of credits a job contributes > > to the scheduler's credit limit. > > > > Signed-off-by: Danilo Krummrich > > --- > > Changes in V2: > > == > > - fixed up influence on scheduling fairness due to consideration of a > > job's > > size > > - If we reach a ready entity in drm_sched_select_entity() but can't > > actually > > queue a job from it due to size limitations, just give up and go to > > sleep > > until woken up due to a pending job finishing, rather than continue > > to try > > other entities. > > - added a callback to dynamically update a job's credits (Boris) > > - renamed 'units' to 'credits' > > - fixed commit message and comments as requested by Luben > > > > Changes in V3: > > == > > - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt > > - move up drm_sched_can_queue() instead of adding a forward declaration > > (Boris) > > - add a drm_sched_available_credits() helper (Boris) > > - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's > > proposal > > - re-phrase a few comments and fix a typo (Luben) > > - change naming of all structures credit fields and function parameters > > to the > > following scheme > > - drm_sched_job::credits > > - drm_gpu_scheduler::credit_limit > > - drm_gpu_scheduler::credit_count > > - drm_sched_init(..., u32 credit_limit, ...) > > - drm_sched_job_init(..., u32 credits, ...) > > - add proper documentation for the scheduler's job-flow control mechanism > > > > Changes in V4: > > == > > - address Lubens comments regarding documentation > > - switch to drm_WARN() variants > > - WARN_ON() drivers passing in zero job credits for both > > drm_sched_job_init() > > and the update_job_credits() callback > > - don't retry with another runq if job doesn't fit on the ring to prevent > > priority inversion > > - rebase onto drm-misc-next (will probably land before Matt's series) > > > > Patch also available in [1]. > > > > [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/ > > --- > > Documentation/gpu/drm-mm.rst | 6 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- > > drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c | 2 +- > > drivers/gpu/drm/lima/lima_sched.c | 2 +- > > drivers/gpu/drm/msm/msm_gem_submit.c | 2 +- > > drivers/gpu/drm/nouveau/nouveau_sched.c | 2 +- > > drivers/gpu/drm/panfrost/panfrost_drv.c | 2 +- > > .../gpu/drm/scheduler/gpu_scheduler_trace.h | 2 +- > > drivers/gpu/drm/scheduler/sched_entity.c | 4 +- > > drivers/gpu/drm/scheduler/sched_main.c| 148 ++ > > drivers/gpu/drm/v3d/v3d_gem.c | 2 +- > > include/drm/gpu_scheduler.h | 31 +++- > > 12 files changed, 156 insertions(+), 49 deletions(-) > > > > diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst > > index 602010cb6894..acc5901ac840 100644 > > --- a/Documentation/gpu/drm-mm.rst > > +++ b/Documentation/gpu/drm-mm.rst > > @@ -552,6 +552,12 @@ Overview > > .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > > :doc: Overview > > > > +Flow Control > > + > > + > > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > > + :doc: Flow Control > > + > > Scheduler Function References > > - > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > index 1f357198533f..62bb7fc7448a 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct > > amdgpu_vm *vm, > > if (!entity) > > return 0; > > > > - return
Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control
Hi, (PSA: luben.tui...@amd.com should've bounced :-) I'm removing it from the To: field.) On 2023-10-30 20:26, Danilo Krummrich wrote: > Currently, job flow control is implemented simply by limiting the number > of jobs in flight. Therefore, a scheduler is initialized with a credit > limit that corresponds to the number of jobs which can be sent to the > hardware. > > This implies that for each job, drivers need to account for the maximum > job size possible in order to not overflow the ring buffer. > > However, there are drivers, such as Nouveau, where the job size has a > rather large range. For such drivers it can easily happen that job > submissions not even filling the ring by 1% can block subsequent > submissions, which, in the worst case, can lead to the ring run dry. > > In order to overcome this issue, allow for tracking the actual job size > instead of the number of jobs. Therefore, add a field to track a job's > credit count, which represents the number of credits a job contributes > to the scheduler's credit limit. > > Signed-off-by: Danilo Krummrich > --- > Changes in V2: > == > - fixed up influence on scheduling fairness due to consideration of a job's > size > - If we reach a ready entity in drm_sched_select_entity() but can't > actually > queue a job from it due to size limitations, just give up and go to > sleep > until woken up due to a pending job finishing, rather than continue to > try > other entities. > - added a callback to dynamically update a job's credits (Boris) > - renamed 'units' to 'credits' > - fixed commit message and comments as requested by Luben > > Changes in V3: > == > - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt > - move up drm_sched_can_queue() instead of adding a forward declaration > (Boris) > - add a drm_sched_available_credits() helper (Boris) > - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's > proposal > - re-phrase a few comments and fix a typo (Luben) > - change naming of all structures credit fields and function parameters to > the > following scheme > - drm_sched_job::credits > - drm_gpu_scheduler::credit_limit > - drm_gpu_scheduler::credit_count > - drm_sched_init(..., u32 credit_limit, ...) > - drm_sched_job_init(..., u32 credits, ...) > - add proper documentation for the scheduler's job-flow control mechanism > > Changes in V4: > == > - address Lubens comments regarding documentation > - switch to drm_WARN() variants > - WARN_ON() drivers passing in zero job credits for both > drm_sched_job_init() > and the update_job_credits() callback > - don't retry with another runq if job doesn't fit on the ring to prevent > priority inversion > - rebase onto drm-misc-next (will probably land before Matt's series) > > Patch also available in [1]. > > [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/ > --- > Documentation/gpu/drm-mm.rst | 6 + > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- > drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c | 2 +- > drivers/gpu/drm/lima/lima_sched.c | 2 +- > drivers/gpu/drm/msm/msm_gem_submit.c | 2 +- > drivers/gpu/drm/nouveau/nouveau_sched.c | 2 +- > drivers/gpu/drm/panfrost/panfrost_drv.c | 2 +- > .../gpu/drm/scheduler/gpu_scheduler_trace.h | 2 +- > drivers/gpu/drm/scheduler/sched_entity.c | 4 +- > drivers/gpu/drm/scheduler/sched_main.c| 148 ++ > drivers/gpu/drm/v3d/v3d_gem.c | 2 +- > include/drm/gpu_scheduler.h | 31 +++- > 12 files changed, 156 insertions(+), 49 deletions(-) > > diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst > index 602010cb6894..acc5901ac840 100644 > --- a/Documentation/gpu/drm-mm.rst > +++ b/Documentation/gpu/drm-mm.rst > @@ -552,6 +552,12 @@ Overview > .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > :doc: Overview > > +Flow Control > + > + > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > + :doc: Flow Control > + > Scheduler Function References > - > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > index 1f357198533f..62bb7fc7448a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct > amdgpu_vm *vm, > if (!entity) > return 0; > > - return drm_sched_job_init(&(*job)->base, entity, owner); > + return drm_sched_job_init(&(*job)->base, entity, 1, owner); > } > > int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c > b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c > index
Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control
On 10/31/23 12:13, Christian König wrote: Am 31.10.23 um 01:26 schrieb Danilo Krummrich: Currently, job flow control is implemented simply by limiting the number of jobs in flight. Therefore, a scheduler is initialized with a credit limit that corresponds to the number of jobs which can be sent to the hardware. This implies that for each job, drivers need to account for the maximum job size possible in order to not overflow the ring buffer. However, there are drivers, such as Nouveau, where the job size has a rather large range. For such drivers it can easily happen that job submissions not even filling the ring by 1% can block subsequent submissions, which, in the worst case, can lead to the ring run dry. In order to overcome this issue, allow for tracking the actual job size instead of the number of jobs. Therefore, add a field to track a job's credit count, which represents the number of credits a job contributes to the scheduler's credit limit. Signed-off-by: Danilo Krummrich --- Changes in V2: == - fixed up influence on scheduling fairness due to consideration of a job's size - If we reach a ready entity in drm_sched_select_entity() but can't actually queue a job from it due to size limitations, just give up and go to sleep until woken up due to a pending job finishing, rather than continue to try other entities. - added a callback to dynamically update a job's credits (Boris) - renamed 'units' to 'credits' - fixed commit message and comments as requested by Luben Changes in V3: == - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt - move up drm_sched_can_queue() instead of adding a forward declaration (Boris) - add a drm_sched_available_credits() helper (Boris) - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal - re-phrase a few comments and fix a typo (Luben) - change naming of all structures credit fields and function parameters to the following scheme - drm_sched_job::credits - drm_gpu_scheduler::credit_limit - drm_gpu_scheduler::credit_count - drm_sched_init(..., u32 credit_limit, ...) - drm_sched_job_init(..., u32 credits, ...) - add proper documentation for the scheduler's job-flow control mechanism Changes in V4: == - address Lubens comments regarding documentation - switch to drm_WARN() variants - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init() and the update_job_credits() callback - don't retry with another runq if job doesn't fit on the ring to prevent priority inversion - rebase onto drm-misc-next (will probably land before Matt's series) Patch also available in [1]. [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/ --- Documentation/gpu/drm-mm.rst | 6 + drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c | 2 +- drivers/gpu/drm/lima/lima_sched.c | 2 +- drivers/gpu/drm/msm/msm_gem_submit.c | 2 +- drivers/gpu/drm/nouveau/nouveau_sched.c | 2 +- drivers/gpu/drm/panfrost/panfrost_drv.c | 2 +- .../gpu/drm/scheduler/gpu_scheduler_trace.h | 2 +- drivers/gpu/drm/scheduler/sched_entity.c | 4 +- drivers/gpu/drm/scheduler/sched_main.c | 148 ++ drivers/gpu/drm/v3d/v3d_gem.c | 2 +- include/drm/gpu_scheduler.h | 31 +++- 12 files changed, 156 insertions(+), 49 deletions(-) diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst index 602010cb6894..acc5901ac840 100644 --- a/Documentation/gpu/drm-mm.rst +++ b/Documentation/gpu/drm-mm.rst @@ -552,6 +552,12 @@ Overview .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c :doc: Overview +Flow Control + + +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c + :doc: Flow Control + Scheduler Function References - diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 1f357198533f..62bb7fc7448a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm, if (!entity) return 0; - return drm_sched_job_init(&(*job)->base, entity, owner); + return drm_sched_job_init(&(*job)->base, entity, 1, owner); } int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c index 2416c526f9b0..3d0f8d182506 100644 --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data, ret =
Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control
Am 31.10.23 um 01:26 schrieb Danilo Krummrich: Currently, job flow control is implemented simply by limiting the number of jobs in flight. Therefore, a scheduler is initialized with a credit limit that corresponds to the number of jobs which can be sent to the hardware. This implies that for each job, drivers need to account for the maximum job size possible in order to not overflow the ring buffer. However, there are drivers, such as Nouveau, where the job size has a rather large range. For such drivers it can easily happen that job submissions not even filling the ring by 1% can block subsequent submissions, which, in the worst case, can lead to the ring run dry. In order to overcome this issue, allow for tracking the actual job size instead of the number of jobs. Therefore, add a field to track a job's credit count, which represents the number of credits a job contributes to the scheduler's credit limit. Signed-off-by: Danilo Krummrich --- Changes in V2: == - fixed up influence on scheduling fairness due to consideration of a job's size - If we reach a ready entity in drm_sched_select_entity() but can't actually queue a job from it due to size limitations, just give up and go to sleep until woken up due to a pending job finishing, rather than continue to try other entities. - added a callback to dynamically update a job's credits (Boris) - renamed 'units' to 'credits' - fixed commit message and comments as requested by Luben Changes in V3: == - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt - move up drm_sched_can_queue() instead of adding a forward declaration (Boris) - add a drm_sched_available_credits() helper (Boris) - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal - re-phrase a few comments and fix a typo (Luben) - change naming of all structures credit fields and function parameters to the following scheme - drm_sched_job::credits - drm_gpu_scheduler::credit_limit - drm_gpu_scheduler::credit_count - drm_sched_init(..., u32 credit_limit, ...) - drm_sched_job_init(..., u32 credits, ...) - add proper documentation for the scheduler's job-flow control mechanism Changes in V4: == - address Lubens comments regarding documentation - switch to drm_WARN() variants - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init() and the update_job_credits() callback - don't retry with another runq if job doesn't fit on the ring to prevent priority inversion - rebase onto drm-misc-next (will probably land before Matt's series) Patch also available in [1]. [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/ --- Documentation/gpu/drm-mm.rst | 6 + drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c | 2 +- drivers/gpu/drm/lima/lima_sched.c | 2 +- drivers/gpu/drm/msm/msm_gem_submit.c | 2 +- drivers/gpu/drm/nouveau/nouveau_sched.c | 2 +- drivers/gpu/drm/panfrost/panfrost_drv.c | 2 +- .../gpu/drm/scheduler/gpu_scheduler_trace.h | 2 +- drivers/gpu/drm/scheduler/sched_entity.c | 4 +- drivers/gpu/drm/scheduler/sched_main.c| 148 ++ drivers/gpu/drm/v3d/v3d_gem.c | 2 +- include/drm/gpu_scheduler.h | 31 +++- 12 files changed, 156 insertions(+), 49 deletions(-) diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst index 602010cb6894..acc5901ac840 100644 --- a/Documentation/gpu/drm-mm.rst +++ b/Documentation/gpu/drm-mm.rst @@ -552,6 +552,12 @@ Overview .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c :doc: Overview +Flow Control + + +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c + :doc: Flow Control + Scheduler Function References - diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 1f357198533f..62bb7fc7448a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm, if (!entity) return 0; - return drm_sched_job_init(&(*job)->base, entity, owner); + return drm_sched_job_init(&(*job)->base, entity, 1, owner); } int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c index 2416c526f9b0..3d0f8d182506 100644 --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data, ret = drm_sched_job_init(>sched_job,
Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control
Hi Danilo, kernel test robot noticed the following build errors: [auto build test ERROR on b2139fb5051554a7f297e4ad584ef1bc26c76d5d] url: https://github.com/intel-lab-lkp/linux/commits/Danilo-Krummrich/drm-sched-implement-dynamic-job-flow-control/20231031-082915 base: b2139fb5051554a7f297e4ad584ef1bc26c76d5d patch link: https://lore.kernel.org/r/20231031002655.38707-1-dakr%40redhat.com patch subject: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control config: loongarch-randconfig-002-20231031 (https://download.01.org/0day-ci/archive/20231031/202310311632.rpehvhmk-...@intel.com/config) compiler: loongarch64-linux-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231031/202310311632.rpehvhmk-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202310311632.rpehvhmk-...@intel.com/ All errors (new ones prefixed by >>): drivers/gpu/drm/etnaviv/etnaviv_gpu.c: In function 'etnaviv_gpu_rpm_suspend': >> drivers/gpu/drm/etnaviv/etnaviv_gpu.c:1920:36: error: 'struct >> drm_gpu_scheduler' has no member named 'hw_rq_count' 1920 | if (atomic_read(>sched.hw_rq_count)) |^ vim +1920 drivers/gpu/drm/etnaviv/etnaviv_gpu.c a8c21a5451d831 The etnaviv authors 2015-12-03 1913 a8c21a5451d831 The etnaviv authors 2015-12-03 1914 static int etnaviv_gpu_rpm_suspend(struct device *dev) a8c21a5451d831 The etnaviv authors 2015-12-03 1915 { a8c21a5451d831 The etnaviv authors 2015-12-03 1916 struct etnaviv_gpu *gpu = dev_get_drvdata(dev); a8c21a5451d831 The etnaviv authors 2015-12-03 1917 u32 idle, mask; a8c21a5451d831 The etnaviv authors 2015-12-03 1918 f4163814813fb3 Lucas Stach 2018-11-05 1919 /* If there are any jobs in the HW queue, we're not idle */ f4163814813fb3 Lucas Stach 2018-11-05 @1920 if (atomic_read(>sched.hw_rq_count)) a8c21a5451d831 The etnaviv authors 2015-12-03 1921 return -EBUSY; a8c21a5451d831 The etnaviv authors 2015-12-03 1922 1a910c11d35bfa Guido Günther 2020-03-02 1923 /* Check whether the hardware (except FE and MC) is idle */ 1a910c11d35bfa Guido Günther 2020-03-02 1924 mask = gpu->idle_mask & ~(VIVS_HI_IDLE_STATE_FE | 1a910c11d35bfa Guido Günther 2020-03-02 1925 VIVS_HI_IDLE_STATE_MC); a8c21a5451d831 The etnaviv authors 2015-12-03 1926 idle = gpu_read(gpu, VIVS_HI_IDLE_STATE) & mask; 78f2bfa3181cd7 Guido Günther 2020-03-02 1927 if (idle != mask) { 78f2bfa3181cd7 Guido Günther 2020-03-02 1928 dev_warn_ratelimited(dev, "GPU not yet idle, mask: 0x%08x\n", 78f2bfa3181cd7 Guido Günther 2020-03-02 1929 idle); a8c21a5451d831 The etnaviv authors 2015-12-03 1930 return -EBUSY; 78f2bfa3181cd7 Guido Günther 2020-03-02 1931 } a8c21a5451d831 The etnaviv authors 2015-12-03 1932 7cb544946a138d Lucas Stach 2023-06-07 1933 etnaviv_gpu_hw_suspend(gpu); 7cb544946a138d Lucas Stach 2023-06-07 1934 647d817d807127 Lucas Stach 2023-06-07 1935 gpu->state = ETNA_GPU_STATE_IDENTIFIED; 647d817d807127 Lucas Stach 2023-06-07 1936 7cb544946a138d Lucas Stach 2023-06-07 1937 return etnaviv_gpu_clk_disable(gpu); a8c21a5451d831 The etnaviv authors 2015-12-03 1938 } a8c21a5451d831 The etnaviv authors 2015-12-03 1939 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control
Hi Danilo, kernel test robot noticed the following build errors: [auto build test ERROR on b2139fb5051554a7f297e4ad584ef1bc26c76d5d] url: https://github.com/intel-lab-lkp/linux/commits/Danilo-Krummrich/drm-sched-implement-dynamic-job-flow-control/20231031-082915 base: b2139fb5051554a7f297e4ad584ef1bc26c76d5d patch link: https://lore.kernel.org/r/20231031002655.38707-1-dakr%40redhat.com patch subject: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control config: arc-randconfig-002-20231031 (https://download.01.org/0day-ci/archive/20231031/202310311534.i8wqwok0-...@intel.com/config) compiler: arceb-elf-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231031/202310311534.i8wqwok0-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202310311534.i8wqwok0-...@intel.com/ All errors (new ones prefixed by >>): drivers/gpu/drm/panfrost/panfrost_job.c: In function 'panfrost_job_is_idle': >> drivers/gpu/drm/panfrost/panfrost_job.c:966:52: error: 'struct >> drm_gpu_scheduler' has no member named 'hw_rq_count' 966 | if (atomic_read(>queue[i].sched.hw_rq_count)) |^ vim +966 drivers/gpu/drm/panfrost/panfrost_job.c f3ba91228e8e91 Rob Herring 2018-09-10 958 f3ba91228e8e91 Rob Herring 2018-09-10 959 int panfrost_job_is_idle(struct panfrost_device *pfdev) f3ba91228e8e91 Rob Herring 2018-09-10 960 { f3ba91228e8e91 Rob Herring 2018-09-10 961 struct panfrost_job_slot *js = pfdev->js; f3ba91228e8e91 Rob Herring 2018-09-10 962 int i; f3ba91228e8e91 Rob Herring 2018-09-10 963 f3ba91228e8e91 Rob Herring 2018-09-10 964 for (i = 0; i < NUM_JOB_SLOTS; i++) { f3ba91228e8e91 Rob Herring 2018-09-10 965 /* If there are any jobs in the HW queue, we're not idle */ f3ba91228e8e91 Rob Herring 2018-09-10 @966 if (atomic_read(>queue[i].sched.hw_rq_count)) -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
[PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control
Currently, job flow control is implemented simply by limiting the number of jobs in flight. Therefore, a scheduler is initialized with a credit limit that corresponds to the number of jobs which can be sent to the hardware. This implies that for each job, drivers need to account for the maximum job size possible in order to not overflow the ring buffer. However, there are drivers, such as Nouveau, where the job size has a rather large range. For such drivers it can easily happen that job submissions not even filling the ring by 1% can block subsequent submissions, which, in the worst case, can lead to the ring run dry. In order to overcome this issue, allow for tracking the actual job size instead of the number of jobs. Therefore, add a field to track a job's credit count, which represents the number of credits a job contributes to the scheduler's credit limit. Signed-off-by: Danilo Krummrich --- Changes in V2: == - fixed up influence on scheduling fairness due to consideration of a job's size - If we reach a ready entity in drm_sched_select_entity() but can't actually queue a job from it due to size limitations, just give up and go to sleep until woken up due to a pending job finishing, rather than continue to try other entities. - added a callback to dynamically update a job's credits (Boris) - renamed 'units' to 'credits' - fixed commit message and comments as requested by Luben Changes in V3: == - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt - move up drm_sched_can_queue() instead of adding a forward declaration (Boris) - add a drm_sched_available_credits() helper (Boris) - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal - re-phrase a few comments and fix a typo (Luben) - change naming of all structures credit fields and function parameters to the following scheme - drm_sched_job::credits - drm_gpu_scheduler::credit_limit - drm_gpu_scheduler::credit_count - drm_sched_init(..., u32 credit_limit, ...) - drm_sched_job_init(..., u32 credits, ...) - add proper documentation for the scheduler's job-flow control mechanism Changes in V4: == - address Lubens comments regarding documentation - switch to drm_WARN() variants - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init() and the update_job_credits() callback - don't retry with another runq if job doesn't fit on the ring to prevent priority inversion - rebase onto drm-misc-next (will probably land before Matt's series) Patch also available in [1]. [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/ --- Documentation/gpu/drm-mm.rst | 6 + drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c | 2 +- drivers/gpu/drm/lima/lima_sched.c | 2 +- drivers/gpu/drm/msm/msm_gem_submit.c | 2 +- drivers/gpu/drm/nouveau/nouveau_sched.c | 2 +- drivers/gpu/drm/panfrost/panfrost_drv.c | 2 +- .../gpu/drm/scheduler/gpu_scheduler_trace.h | 2 +- drivers/gpu/drm/scheduler/sched_entity.c | 4 +- drivers/gpu/drm/scheduler/sched_main.c| 148 ++ drivers/gpu/drm/v3d/v3d_gem.c | 2 +- include/drm/gpu_scheduler.h | 31 +++- 12 files changed, 156 insertions(+), 49 deletions(-) diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst index 602010cb6894..acc5901ac840 100644 --- a/Documentation/gpu/drm-mm.rst +++ b/Documentation/gpu/drm-mm.rst @@ -552,6 +552,12 @@ Overview .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c :doc: Overview +Flow Control + + +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c + :doc: Flow Control + Scheduler Function References - diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 1f357198533f..62bb7fc7448a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm, if (!entity) return 0; - return drm_sched_job_init(&(*job)->base, entity, owner); + return drm_sched_job_init(&(*job)->base, entity, 1, owner); } int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c index 2416c526f9b0..3d0f8d182506 100644 --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data, ret = drm_sched_job_init(>sched_job, >sched_entity[args->pipe], -submit->ctx);