+ dri-devel

On Tue, May 9, 2023 at 5:43 PM <vitaly.pros...@amd.com> wrote:
>
> From: Vitaly Prosyak <vitaly.pros...@amd.com>
>
> During an IGT GPU reset test we see again oops despite of
> commit 0c8c901aaaebc9bf8bf189ffc116e678f7a2dc16
> drm/sched: Check scheduler ready before calling timeout handling.
>
> It uses ready condition whether to call drm_sched_fault which unwind
> the TDR leads to GPU reset.
> However it looks the ready condition is overloaded with other meanings,
> for example, for the following stack is related GPU reset :
>
> 0  gfx_v9_0_cp_gfx_start
> 1  gfx_v9_0_cp_gfx_resume
> 2  gfx_v9_0_cp_resume
> 3  gfx_v9_0_hw_init
> 4  gfx_v9_0_resume
> 5  amdgpu_device_ip_resume_phase2
>
> does the following:
>         /* start the ring */
>         gfx_v9_0_cp_gfx_start(adev);
>         ring->sched.ready = true;
>
> The same approach is for other ASICs as well :
> gfx_v8_0_cp_gfx_resume
> gfx_v10_0_kiq_resume, etc...
>
> As a result, our GPU reset test causes GPU fault which calls unconditionally 
> gfx_v9_0_fault
> and then drm_sched_fault. However now it depends on whether the interrupt 
> service routine
> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which 
> sets the ready
> field of the scheduler to true even  for not initialized schedulers and 
> causes oops vs
> no fault or when ISR  drm_sched_fault is completed prior  
> gfx_v9_0_cp_gfx_start and
> NULL pointer dereference does not occur.
>
> Use the field timeout_wq  to prevent oops for uninitialized schedulers.
> The field could be initialized by the work queue of resetting the domain.
>
> Signed-off-by: Vitaly Prosyak <vitaly.pros...@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 649fac2e1ccb..670b7997f389 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct 
> drm_gpu_scheduler *sched)
>   */
>  void drm_sched_fault(struct drm_gpu_scheduler *sched)
>  {
> -       if (sched->ready)
> +       if (sched->timeout_wq)
>                 mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
>  }
>  EXPORT_SYMBOL(drm_sched_fault);
> --
> 2.25.1
>

Reply via email to