Re: drm scheduler redesign causes deadlocks [extended repost]
On 2023-11-24 04:38, Bert Karwatzki wrote: > Am Mittwoch, dem 22.11.2023 um 18:02 -0500 schrieb Luben Tuikov: >> On 2023-11-21 04:00, Bert Karwatzki wrote: >>> Since linux-next-20231115 my linux system (debian sid on msi alpha 15 >>> laptop) >>> suffers from random deadlocks which can occur after 30 - 180min of usage. >>> These >>> deadlocks can be actively provoked by creating high system load (usually by >>> compiling a kernel with make -j NRCPUS) and the opening instances of >>> libreoffice >>> --writer until the system GUI locks (the mouse cursor can still be moved but >>> the >>> screen is frozen). In this state ssh'ing into the machine is still possible >>> and >>> at least sometimes log messages about hung tasks appear in >>> /var/log/kern.log. >>> >>> More info can be found here: >>> https://gitlab.freedesktop.org/drm/amd/-/issues/2994 >>> >>> Using the method described to trigger the bug I bisected the problem in the >>> linux-next and drm-misc trees to give commit f3123c2590005 as the problem. >>> As this simple patch fixes the problem >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>> b/drivers/gpu/drm/scheduler/sched_main.c >>> index 044a8c4875ba..25b97db1b623 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); >>> void drm_sched_wakeup(struct drm_gpu_scheduler *sched, >>> struct drm_sched_entity *entity) >>> { >>> - if (drm_sched_entity_is_ready(entity)) >>> - if (drm_sched_can_queue(sched, entity)) >>> - drm_sched_run_job_queue(sched); >>> + if (drm_sched_can_queue(sched, entity)) >>> + drm_sched_run_job_queue(sched); >>> } >>> >>> /** >>> >>> there might be in the entity->dependency branch of drm_sched_entity_is_ready >>> (some kind of circular dependencies ...). >>> >>> To see if the change to drm_sched_wakeup is the actual cause of the problem >>> or >>> if this problem has been cause by the redesign of the drm scheduler in linux >>> next-20231115+, I created the following patch for linux-6.6.0: >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >>> b/drivers/gpu/drm/scheduler/sched_entity.c >>> index a42763e1429d..dc2abd299aeb 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_entity.c >>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >>> @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, >>> container_of(cb, struct drm_sched_entity, cb); >>> >>> drm_sched_entity_clear_dep(f, cb); >>> - drm_sched_wakeup_if_can_queue(entity->rq->sched); >>> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity); >>> } >>> >>> /** >>> @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job >>> *sched_job) >>> if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) >>> drm_sched_rq_update_fifo(entity, submit_ts); >>> >>> - drm_sched_wakeup_if_can_queue(entity->rq->sched); >>> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity); >>> } >>> } >>> EXPORT_SYMBOL(drm_sched_entity_push_job); >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>> b/drivers/gpu/drm/scheduler/sched_main.c >>> index 5a3a622fc672..bbe06403b33d 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct >>> drm_gpu_scheduler >>> *sched) >>> * >>> * Wake up the scheduler if we can queue jobs. >>> */ >>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched) >>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct >>> drm_sched_entity *entity) >>> { >>> - if (drm_sched_can_queue(sched)) >>> - wake_up_interruptible(>wake_up_worker); >>> + if(drm_sched_entity_is_ready(entity)) >>> + if (drm_sched_can_queue(sched)) >>> + wake_up_interruptible(>wake_up_worker); >>> } >>> >>> /** >>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h >>> index ac65f0626cfc..6cfe3d193e69 100644 >>> --- a/include/drm/gpu_scheduler.h >>> +++ b/include/drm/gpu_scheduler.h >>> @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct >>> drm_sched_entity >>> *entity, >>> unsigned int num_sched_list); >>> >>> void drm_sched_job_cleanup(struct drm_sched_job *job); >>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched); >>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct >>> drm_sched_entity *entity); >>> void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job >>> *bad); >>> void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery); >>> void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched); >>> >>> This brings the extra check to the old scheduler and has so far not caused >>> any >>> trouble (using the same stress test described above), so chances are that >>> the >>> error is
Re: drm scheduler redesign causes deadlocks [extended repost]
Am Mittwoch, dem 22.11.2023 um 18:02 -0500 schrieb Luben Tuikov: > On 2023-11-21 04:00, Bert Karwatzki wrote: > > Since linux-next-20231115 my linux system (debian sid on msi alpha 15 > > laptop) > > suffers from random deadlocks which can occur after 30 - 180min of usage. > > These > > deadlocks can be actively provoked by creating high system load (usually by > > compiling a kernel with make -j NRCPUS) and the opening instances of > > libreoffice > > --writer until the system GUI locks (the mouse cursor can still be moved but > > the > > screen is frozen). In this state ssh'ing into the machine is still possible > > and > > at least sometimes log messages about hung tasks appear in > > /var/log/kern.log. > > > > More info can be found here: > > https://gitlab.freedesktop.org/drm/amd/-/issues/2994 > > > > Using the method described to trigger the bug I bisected the problem in the > > linux-next and drm-misc trees to give commit f3123c2590005 as the problem. > > As this simple patch fixes the problem > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > b/drivers/gpu/drm/scheduler/sched_main.c > > index 044a8c4875ba..25b97db1b623 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > > struct drm_sched_entity *entity) > > { > > - if (drm_sched_entity_is_ready(entity)) > > - if (drm_sched_can_queue(sched, entity)) > > - drm_sched_run_job_queue(sched); > > + if (drm_sched_can_queue(sched, entity)) > > + drm_sched_run_job_queue(sched); > > } > > > > /** > > > > there might be in the entity->dependency branch of drm_sched_entity_is_ready > > (some kind of circular dependencies ...). > > > > To see if the change to drm_sched_wakeup is the actual cause of the problem > > or > > if this problem has been cause by the redesign of the drm scheduler in linux > > next-20231115+, I created the following patch for linux-6.6.0: > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > > b/drivers/gpu/drm/scheduler/sched_entity.c > > index a42763e1429d..dc2abd299aeb 100644 > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, > > container_of(cb, struct drm_sched_entity, cb); > > > > drm_sched_entity_clear_dep(f, cb); > > - drm_sched_wakeup_if_can_queue(entity->rq->sched); > > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity); > > } > > > > /** > > @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job > > *sched_job) > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > drm_sched_rq_update_fifo(entity, submit_ts); > > > > - drm_sched_wakeup_if_can_queue(entity->rq->sched); > > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity); > > } > > } > > EXPORT_SYMBOL(drm_sched_entity_push_job); > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > b/drivers/gpu/drm/scheduler/sched_main.c > > index 5a3a622fc672..bbe06403b33d 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct > > drm_gpu_scheduler > > *sched) > > * > > * Wake up the scheduler if we can queue jobs. > > */ > > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched) > > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct > > drm_sched_entity *entity) > > { > > - if (drm_sched_can_queue(sched)) > > - wake_up_interruptible(>wake_up_worker); > > + if(drm_sched_entity_is_ready(entity)) > > + if (drm_sched_can_queue(sched)) > > + wake_up_interruptible(>wake_up_worker); > > } > > > > /** > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > > index ac65f0626cfc..6cfe3d193e69 100644 > > --- a/include/drm/gpu_scheduler.h > > +++ b/include/drm/gpu_scheduler.h > > @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct > > drm_sched_entity > > *entity, > > unsigned int num_sched_list); > > > > void drm_sched_job_cleanup(struct drm_sched_job *job); > > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched); > > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct > > drm_sched_entity *entity); > > void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job > > *bad); > > void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery); > > void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched); > > > > This brings the extra check to the old scheduler and has so far not caused > > any > > trouble (using the same stress test described above), so chances are that > > the > > error is somewhere else in redesigned scheduler. > > > > > >
Re: drm scheduler redesign causes deadlocks [extended repost]
On 2023-11-21 04:00, Bert Karwatzki wrote: > Since linux-next-20231115 my linux system (debian sid on msi alpha 15 laptop) > suffers from random deadlocks which can occur after 30 - 180min of usage. > These > deadlocks can be actively provoked by creating high system load (usually by > compiling a kernel with make -j NRCPUS) and the opening instances of > libreoffice > --writer until the system GUI locks (the mouse cursor can still be moved but > the > screen is frozen). In this state ssh'ing into the machine is still possible > and > at least sometimes log messages about hung tasks appear in /var/log/kern.log. > > More info can be found here: > https://gitlab.freedesktop.org/drm/amd/-/issues/2994 > > Using the method described to trigger the bug I bisected the problem in the > linux-next and drm-misc trees to give commit f3123c2590005 as the problem. > As this simple patch fixes the problem > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index 044a8c4875ba..25b97db1b623 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > struct drm_sched_entity *entity) > { > - if (drm_sched_entity_is_ready(entity)) > - if (drm_sched_can_queue(sched, entity)) > - drm_sched_run_job_queue(sched); > + if (drm_sched_can_queue(sched, entity)) > + drm_sched_run_job_queue(sched); > } > > /** > > there might be in the entity->dependency branch of drm_sched_entity_is_ready > (some kind of circular dependencies ...). > > To see if the change to drm_sched_wakeup is the actual cause of the problem or > if this problem has been cause by the redesign of the drm scheduler in linux > next-20231115+, I created the following patch for linux-6.6.0: > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > b/drivers/gpu/drm/scheduler/sched_entity.c > index a42763e1429d..dc2abd299aeb 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, > container_of(cb, struct drm_sched_entity, cb); > > drm_sched_entity_clear_dep(f, cb); > - drm_sched_wakeup_if_can_queue(entity->rq->sched); > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity); > } > > /** > @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job > *sched_job) > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > drm_sched_rq_update_fifo(entity, submit_ts); > > - drm_sched_wakeup_if_can_queue(entity->rq->sched); > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity); > } > } > EXPORT_SYMBOL(drm_sched_entity_push_job); > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index 5a3a622fc672..bbe06403b33d 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler > *sched) > * > * Wake up the scheduler if we can queue jobs. > */ > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched) > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct > drm_sched_entity *entity) > { > - if (drm_sched_can_queue(sched)) > - wake_up_interruptible(>wake_up_worker); > + if(drm_sched_entity_is_ready(entity)) > + if (drm_sched_can_queue(sched)) > + wake_up_interruptible(>wake_up_worker); > } > > /** > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > index ac65f0626cfc..6cfe3d193e69 100644 > --- a/include/drm/gpu_scheduler.h > +++ b/include/drm/gpu_scheduler.h > @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity > *entity, > unsigned int num_sched_list); > > void drm_sched_job_cleanup(struct drm_sched_job *job); > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched); > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct > drm_sched_entity *entity); > void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job > *bad); > void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery); > void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched); > > This brings the extra check to the old scheduler and has so far not caused any > trouble (using the same stress test described above), so chances are that the > error is somewhere else in redesigned scheduler. > > > Bert Karwatzki Hi Bert, Thanks for looking into this. As an afterthought, removing the "entity_is_ready()" qualifier in wake-up, makes the scheduling more opportunistic, and I agree that that is the more correct approach. Commit f3123c2590005, basically made the code as close to the way it
Re: drm scheduler redesign causes deadlocks [extended repost]
Am Dienstag, dem 21.11.2023 um 11:22 +0200 schrieb Jani Nikula: > On Tue, 21 Nov 2023, Bert Karwatzki wrote: > > As this simple patch fixes the problem > > Please use git send-email to send patches. Evolution botched up the > whitespace here. > > BR, > Jani. > > Noted (actually not the first time evolution has done this). But I'm not resending the patches here because they are not meant as actual fixes, they only serve to illustrate the problem, the cause of which is not known, yet. Bert Karwatzki
Re: drm scheduler redesign causes deadlocks [extended repost]
On Tue, 21 Nov 2023, Bert Karwatzki wrote: > As this simple patch fixes the problem Please use git send-email to send patches. Evolution botched up the whitespace here. BR, Jani. -- Jani Nikula, Intel