Re: drm scheduler redesign causes deadlocks [extended repost]

2023-11-25 Thread Luben Tuikov
On 2023-11-24 04:38, Bert Karwatzki wrote:
> Am Mittwoch, dem 22.11.2023 um 18:02 -0500 schrieb Luben Tuikov:
>> On 2023-11-21 04:00, Bert Karwatzki wrote:
>>> Since linux-next-20231115 my linux system (debian sid on msi alpha 15
>>> laptop)
>>> suffers from random deadlocks which can occur after  30 - 180min of usage.
>>> These
>>> deadlocks can be actively provoked by creating high system load (usually by
>>> compiling a kernel with make -j NRCPUS) and the opening instances of
>>> libreoffice
>>> --writer until the system GUI locks (the mouse cursor can still be moved but
>>> the
>>> screen is frozen). In this state ssh'ing into the machine is still possible
>>> and
>>> at least sometimes log messages about hung tasks appear in
>>> /var/log/kern.log.
>>>
>>> More info can be found here:
>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2994
>>>
>>> Using the method described to trigger the bug I bisected the problem in the
>>> linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
>>> As this simple patch fixes the problem
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 044a8c4875ba..25b97db1b623 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>>>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>>>   struct drm_sched_entity *entity)
>>>  {
>>> -   if (drm_sched_entity_is_ready(entity))
>>> -   if (drm_sched_can_queue(sched, entity))
>>> -   drm_sched_run_job_queue(sched);
>>> +   if (drm_sched_can_queue(sched, entity))
>>> +   drm_sched_run_job_queue(sched);
>>>  }
>>>  
>>>  /**
>>>
>>> there might be in the entity->dependency branch of drm_sched_entity_is_ready
>>> (some kind of circular dependencies ...).
>>>
>>> To see if the change to drm_sched_wakeup is the actual cause of the problem
>>> or
>>> if this problem has been cause by the redesign of the drm scheduler in linux
>>> next-20231115+, I created the following patch for linux-6.6.0:
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index a42763e1429d..dc2abd299aeb 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>>  container_of(cb, struct drm_sched_entity, cb);
>>>
>>>  drm_sched_entity_clear_dep(f, cb);
>>> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>  }
>>>
>>>  /**
>>> @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
>>> *sched_job)
>>>  if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>  drm_sched_rq_update_fifo(entity, submit_ts);
>>>
>>> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>  }
>>>  }
>>>  EXPORT_SYMBOL(drm_sched_entity_push_job);
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 5a3a622fc672..bbe06403b33d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct
>>> drm_gpu_scheduler
>>> *sched)
>>>   *
>>>   * Wake up the scheduler if we can queue jobs.
>>>   */
>>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
>>> drm_sched_entity *entity)
>>>  {
>>> - if (drm_sched_can_queue(sched))
>>> - wake_up_interruptible(>wake_up_worker);
>>> + if(drm_sched_entity_is_ready(entity))
>>> + if (drm_sched_can_queue(sched))
>>> + wake_up_interruptible(>wake_up_worker);
>>>  }
>>>
>>>  /**
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index ac65f0626cfc..6cfe3d193e69 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct
>>> drm_sched_entity
>>> *entity,
>>>     unsigned int num_sched_list);
>>>
>>>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
>>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
>>> drm_sched_entity *entity);
>>>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job
>>> *bad);
>>>  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>>>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>>>
>>> This brings the extra check to the old scheduler and has so far not caused
>>> any
>>> trouble (using the same stress test described above), so chances are that
>>> the
>>> error is 

Re: drm scheduler redesign causes deadlocks [extended repost]

2023-11-24 Thread Bert Karwatzki
Am Mittwoch, dem 22.11.2023 um 18:02 -0500 schrieb Luben Tuikov:
> On 2023-11-21 04:00, Bert Karwatzki wrote:
> > Since linux-next-20231115 my linux system (debian sid on msi alpha 15
> > laptop)
> > suffers from random deadlocks which can occur after  30 - 180min of usage.
> > These
> > deadlocks can be actively provoked by creating high system load (usually by
> > compiling a kernel with make -j NRCPUS) and the opening instances of
> > libreoffice
> > --writer until the system GUI locks (the mouse cursor can still be moved but
> > the
> > screen is frozen). In this state ssh'ing into the machine is still possible
> > and
> > at least sometimes log messages about hung tasks appear in
> > /var/log/kern.log.
> >
> > More info can be found here:
> > https://gitlab.freedesktop.org/drm/amd/-/issues/2994
> >
> > Using the method described to trigger the bug I bisected the problem in the
> > linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
> > As this simple patch fixes the problem
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 044a8c4875ba..25b97db1b623 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
> >  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
> >   struct drm_sched_entity *entity)
> >  {
> > -   if (drm_sched_entity_is_ready(entity))
> > -   if (drm_sched_can_queue(sched, entity))
> > -   drm_sched_run_job_queue(sched);
> > +   if (drm_sched_can_queue(sched, entity))
> > +   drm_sched_run_job_queue(sched);
> >  }
> >  
> >  /**
> >
> > there might be in the entity->dependency branch of drm_sched_entity_is_ready
> > (some kind of circular dependencies ...).
> >
> > To see if the change to drm_sched_wakeup is the actual cause of the problem
> > or
> > if this problem has been cause by the redesign of the drm scheduler in linux
> > next-20231115+, I created the following patch for linux-6.6.0:
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > b/drivers/gpu/drm/scheduler/sched_entity.c
> > index a42763e1429d..dc2abd299aeb 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
> >  container_of(cb, struct drm_sched_entity, cb);
> >
> >  drm_sched_entity_clear_dep(f, cb);
> > - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
> >  }
> >
> >  /**
> > @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
> > *sched_job)
> >  if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> >  drm_sched_rq_update_fifo(entity, submit_ts);
> >
> > - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
> >  }
> >  }
> >  EXPORT_SYMBOL(drm_sched_entity_push_job);
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 5a3a622fc672..bbe06403b33d 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct
> > drm_gpu_scheduler
> > *sched)
> >   *
> >   * Wake up the scheduler if we can queue jobs.
> >   */
> > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> > drm_sched_entity *entity)
> >  {
> > - if (drm_sched_can_queue(sched))
> > - wake_up_interruptible(>wake_up_worker);
> > + if(drm_sched_entity_is_ready(entity))
> > + if (drm_sched_can_queue(sched))
> > + wake_up_interruptible(>wake_up_worker);
> >  }
> >
> >  /**
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index ac65f0626cfc..6cfe3d193e69 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct
> > drm_sched_entity
> > *entity,
> >     unsigned int num_sched_list);
> >
> >  void drm_sched_job_cleanup(struct drm_sched_job *job);
> > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> > drm_sched_entity *entity);
> >  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job
> > *bad);
> >  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
> >  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> >
> > This brings the extra check to the old scheduler and has so far not caused
> > any
> > trouble (using the same stress test described above), so chances are that
> > the
> > error is somewhere else in redesigned scheduler.
> >
> >
> > 

Re: drm scheduler redesign causes deadlocks [extended repost]

2023-11-22 Thread Luben Tuikov
On 2023-11-21 04:00, Bert Karwatzki wrote:
> Since linux-next-20231115 my linux system (debian sid on msi alpha 15 laptop)
> suffers from random deadlocks which can occur after  30 - 180min of usage. 
> These
> deadlocks can be actively provoked by creating high system load (usually by
> compiling a kernel with make -j NRCPUS) and the opening instances of 
> libreoffice
> --writer until the system GUI locks (the mouse cursor can still be moved but 
> the
> screen is frozen). In this state ssh'ing into the machine is still possible 
> and
> at least sometimes log messages about hung tasks appear in /var/log/kern.log.
> 
> More info can be found here:
> https://gitlab.freedesktop.org/drm/amd/-/issues/2994
> 
> Using the method described to trigger the bug I bisected the problem in the
> linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
> As this simple patch fixes the problem
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 044a8c4875ba..25b97db1b623 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>   struct drm_sched_entity *entity)
>  {
> -   if (drm_sched_entity_is_ready(entity))
> -   if (drm_sched_can_queue(sched, entity))
> -   drm_sched_run_job_queue(sched);
> +   if (drm_sched_can_queue(sched, entity))
> +   drm_sched_run_job_queue(sched);
>  }
>  
>  /**
> 
> there might be in the entity->dependency branch of drm_sched_entity_is_ready
> (some kind of circular dependencies ...).
> 
> To see if the change to drm_sched_wakeup is the actual cause of the problem or
> if this problem has been cause by the redesign of the drm scheduler in linux
> next-20231115+, I created the following patch for linux-6.6.0:
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index a42763e1429d..dc2abd299aeb 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>  container_of(cb, struct drm_sched_entity, cb);
> 
>  drm_sched_entity_clear_dep(f, cb);
> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  }
> 
>  /**
> @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
> *sched_job)
>  if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>  drm_sched_rq_update_fifo(entity, submit_ts);
> 
> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  }
>  }
>  EXPORT_SYMBOL(drm_sched_entity_push_job);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 5a3a622fc672..bbe06403b33d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler
> *sched)
>   *
>   * Wake up the scheduler if we can queue jobs.
>   */
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> drm_sched_entity *entity)
>  {
> - if (drm_sched_can_queue(sched))
> - wake_up_interruptible(>wake_up_worker);
> + if(drm_sched_entity_is_ready(entity))
> + if (drm_sched_can_queue(sched))
> + wake_up_interruptible(>wake_up_worker);
>  }
> 
>  /**
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index ac65f0626cfc..6cfe3d193e69 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity
> *entity,
> unsigned int num_sched_list);
> 
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> drm_sched_entity *entity);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job
> *bad);
>  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> 
> This brings the extra check to the old scheduler and has so far not caused any
> trouble (using the same stress test described above), so chances are that the
> error is somewhere else in redesigned scheduler.
> 
> 
> Bert Karwatzki

Hi Bert,

Thanks for looking into this.

As an afterthought, removing the "entity_is_ready()" qualifier in wake-up, makes
the scheduling more opportunistic, and I agree that that is the more correct 
approach.
Commit f3123c2590005, basically made the code as close to the way it 

Re: drm scheduler redesign causes deadlocks [extended repost]

2023-11-21 Thread Bert Karwatzki
Am Dienstag, dem 21.11.2023 um 11:22 +0200 schrieb Jani Nikula:
> On Tue, 21 Nov 2023, Bert Karwatzki  wrote:
> > As this simple patch fixes the problem
>
> Please use git send-email to send patches. Evolution botched up the
> whitespace here.
>
> BR,
> Jani.
>
>

Noted (actually not the first time evolution has done this). But I'm not
resending the patches here because they are not meant as actual fixes, they only
serve to illustrate the problem, the cause of which is not known, yet.

Bert Karwatzki


Re: drm scheduler redesign causes deadlocks [extended repost]

2023-11-21 Thread Jani Nikula
On Tue, 21 Nov 2023, Bert Karwatzki  wrote:
> As this simple patch fixes the problem

Please use git send-email to send patches. Evolution botched up the
whitespace here.

BR,
Jani.


-- 
Jani Nikula, Intel