Re: [Intel-gfx] [PATCH] drm/i915: Flush pending interrupt following a GPU reset

2018-03-21 Thread Jeff McGee
On Wed, Mar 21, 2018 at 04:42:32PM +, Chris Wilson wrote:
> Quoting Jeff McGee (2018-03-21 15:55:16)
> > On Wed, Mar 21, 2018 at 03:00:23PM +, Chris Wilson wrote:
> > > After resetting the GPU (or subset of engines), call synchronize_irq()
> > > to flush any pending irq before proceeding with the cleanup. For a
> > > device level reset, we disable the interupts around the reset, but when
> > > resetting just one engine, we have to avoid such global disabling. This
> > > leaves us open to an interrupt arriving for the engine as we try to
> > > reset it. We already do try to flush the IIR following the reset, but we
> > > have to ensure that the in-flight interrupt does not land after we start
> > > cleaning up after the reset; enter synchronize_irq().
> > > 
> > > As it current stands, we very rarely, but fatally, see sequences such as:
> > > 
> > > 2 57964564us : execlists_reset_prepare: rcs0
> > > 2 57964613us : execlists_reset: rcs0 seqno=424
> > > 0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1
> > > 2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- 
> > > global_seqno 1060
> > > 2 57964703us : execlists_reset_finish: rcs0
> > > 0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, 
> > > active=0, irq-posted?=1
> > > 
> > I can repro this sequence easily with force preemption IGT.
> 
> With the sequence I suggested?
> -Chris

Yes. Your approach to protecting port[1] context is working well. This is
the only issue I'm still hitting. I'll post my updated RFC set in a sec.
-Jeff
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Flush pending interrupt following a GPU reset

2018-03-21 Thread Chris Wilson
Quoting Jeff McGee (2018-03-21 15:55:16)
> On Wed, Mar 21, 2018 at 03:00:23PM +, Chris Wilson wrote:
> > After resetting the GPU (or subset of engines), call synchronize_irq()
> > to flush any pending irq before proceeding with the cleanup. For a
> > device level reset, we disable the interupts around the reset, but when
> > resetting just one engine, we have to avoid such global disabling. This
> > leaves us open to an interrupt arriving for the engine as we try to
> > reset it. We already do try to flush the IIR following the reset, but we
> > have to ensure that the in-flight interrupt does not land after we start
> > cleaning up after the reset; enter synchronize_irq().
> > 
> > As it current stands, we very rarely, but fatally, see sequences such as:
> > 
> > 2 57964564us : execlists_reset_prepare: rcs0
> > 2 57964613us : execlists_reset: rcs0 seqno=424
> > 0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1
> > 2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- 
> > global_seqno 1060
> > 2 57964703us : execlists_reset_finish: rcs0
> > 0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, 
> > active=0, irq-posted?=1
> > 
> I can repro this sequence easily with force preemption IGT.

With the sequence I suggested?
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Flush pending interrupt following a GPU reset

2018-03-21 Thread Chris Wilson
Quoting Jeff McGee (2018-03-21 15:55:16)
> On Wed, Mar 21, 2018 at 03:00:23PM +, Chris Wilson wrote:
> > After resetting the GPU (or subset of engines), call synchronize_irq()
> > to flush any pending irq before proceeding with the cleanup. For a
> > device level reset, we disable the interupts around the reset, but when
> > resetting just one engine, we have to avoid such global disabling. This
> > leaves us open to an interrupt arriving for the engine as we try to
> > reset it. We already do try to flush the IIR following the reset, but we
> > have to ensure that the in-flight interrupt does not land after we start
> > cleaning up after the reset; enter synchronize_irq().
> > 
> > As it current stands, we very rarely, but fatally, see sequences such as:
> > 
> > 2 57964564us : execlists_reset_prepare: rcs0
> > 2 57964613us : execlists_reset: rcs0 seqno=424
> > 0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1
> > 2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- 
> > global_seqno 1060
> > 2 57964703us : execlists_reset_finish: rcs0
> > 0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, 
> > active=0, irq-posted?=1
> > 
> I can repro this sequence easily with force preemption IGT. Just tried this
> patch and the issue is still there. For the moment I can just mitigate with
> https://patchwork.freedesktop.org/patch/211086/

These patch is purely to make sure that the irq_handler is not called
after execlists_reset, as that is what we rely on.

We *should* never be in a situation where CSB tail is invalid, that
implies either we screwed up or the hw is incoherent.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Flush pending interrupt following a GPU reset

2018-03-21 Thread Jeff McGee
On Wed, Mar 21, 2018 at 03:00:23PM +, Chris Wilson wrote:
> After resetting the GPU (or subset of engines), call synchronize_irq()
> to flush any pending irq before proceeding with the cleanup. For a
> device level reset, we disable the interupts around the reset, but when
> resetting just one engine, we have to avoid such global disabling. This
> leaves us open to an interrupt arriving for the engine as we try to
> reset it. We already do try to flush the IIR following the reset, but we
> have to ensure that the in-flight interrupt does not land after we start
> cleaning up after the reset; enter synchronize_irq().
> 
> As it current stands, we very rarely, but fatally, see sequences such as:
> 
> 2 57964564us : execlists_reset_prepare: rcs0
> 2 57964613us : execlists_reset: rcs0 seqno=424
> 0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1
> 2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- 
> global_seqno 1060
> 2 57964703us : execlists_reset_finish: rcs0
> 0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, active=0, 
> irq-posted?=1
> 
I can repro this sequence easily with force preemption IGT. Just tried this
patch and the issue is still there. For the moment I can just mitigate with
https://patchwork.freedesktop.org/patch/211086/

> Signed-off-by: Chris Wilson 
> Cc: Mika Kuoppala 
> Cc: Michel Thierry 
> Cc: MichaƂ Winiarski 
> ---
>  drivers/gpu/drm/i915/intel_uncore.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c 
> b/drivers/gpu/drm/i915/intel_uncore.c
> index 4c616d074a97..04830d6125d6 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -2116,11 +2116,14 @@ int intel_gpu_reset(struct drm_i915_private 
> *dev_priv, unsigned engine_mask)
>   i915_stop_engines(dev_priv, engine_mask);
>  
>   ret = -ENODEV;
> - if (reset)
> + if (reset) {
> + GEM_TRACE("engine_mask=%x\n", engine_mask);
>   ret = reset(dev_priv, engine_mask);
> + }
>   if (ret != -ETIMEDOUT)
>   break;
>  
> + synchronize_irq(dev_priv->drm.irq);
>   cond_resched();
>   }
>   intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> -- 
> 2.16.2
> 
> ___
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx