Re: [Intel-gfx] [PATCH] drm/i915: Flush pending interrupt following a GPU reset
On Wed, Mar 21, 2018 at 04:42:32PM +, Chris Wilson wrote: > Quoting Jeff McGee (2018-03-21 15:55:16) > > On Wed, Mar 21, 2018 at 03:00:23PM +, Chris Wilson wrote: > > > After resetting the GPU (or subset of engines), call synchronize_irq() > > > to flush any pending irq before proceeding with the cleanup. For a > > > device level reset, we disable the interupts around the reset, but when > > > resetting just one engine, we have to avoid such global disabling. This > > > leaves us open to an interrupt arriving for the engine as we try to > > > reset it. We already do try to flush the IIR following the reset, but we > > > have to ensure that the in-flight interrupt does not land after we start > > > cleaning up after the reset; enter synchronize_irq(). > > > > > > As it current stands, we very rarely, but fatally, see sequences such as: > > > > > > 2 57964564us : execlists_reset_prepare: rcs0 > > > 2 57964613us : execlists_reset: rcs0 seqno=424 > > > 0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1 > > > 2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- > > > global_seqno 1060 > > > 2 57964703us : execlists_reset_finish: rcs0 > > > 0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, > > > active=0, irq-posted?=1 > > > > > I can repro this sequence easily with force preemption IGT. > > With the sequence I suggested? > -Chris Yes. Your approach to protecting port[1] context is working well. This is the only issue I'm still hitting. I'll post my updated RFC set in a sec. -Jeff ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH] drm/i915: Flush pending interrupt following a GPU reset
Quoting Jeff McGee (2018-03-21 15:55:16) > On Wed, Mar 21, 2018 at 03:00:23PM +, Chris Wilson wrote: > > After resetting the GPU (or subset of engines), call synchronize_irq() > > to flush any pending irq before proceeding with the cleanup. For a > > device level reset, we disable the interupts around the reset, but when > > resetting just one engine, we have to avoid such global disabling. This > > leaves us open to an interrupt arriving for the engine as we try to > > reset it. We already do try to flush the IIR following the reset, but we > > have to ensure that the in-flight interrupt does not land after we start > > cleaning up after the reset; enter synchronize_irq(). > > > > As it current stands, we very rarely, but fatally, see sequences such as: > > > > 2 57964564us : execlists_reset_prepare: rcs0 > > 2 57964613us : execlists_reset: rcs0 seqno=424 > > 0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1 > > 2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- > > global_seqno 1060 > > 2 57964703us : execlists_reset_finish: rcs0 > > 0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, > > active=0, irq-posted?=1 > > > I can repro this sequence easily with force preemption IGT. With the sequence I suggested? -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH] drm/i915: Flush pending interrupt following a GPU reset
Quoting Jeff McGee (2018-03-21 15:55:16) > On Wed, Mar 21, 2018 at 03:00:23PM +, Chris Wilson wrote: > > After resetting the GPU (or subset of engines), call synchronize_irq() > > to flush any pending irq before proceeding with the cleanup. For a > > device level reset, we disable the interupts around the reset, but when > > resetting just one engine, we have to avoid such global disabling. This > > leaves us open to an interrupt arriving for the engine as we try to > > reset it. We already do try to flush the IIR following the reset, but we > > have to ensure that the in-flight interrupt does not land after we start > > cleaning up after the reset; enter synchronize_irq(). > > > > As it current stands, we very rarely, but fatally, see sequences such as: > > > > 2 57964564us : execlists_reset_prepare: rcs0 > > 2 57964613us : execlists_reset: rcs0 seqno=424 > > 0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1 > > 2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- > > global_seqno 1060 > > 2 57964703us : execlists_reset_finish: rcs0 > > 0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, > > active=0, irq-posted?=1 > > > I can repro this sequence easily with force preemption IGT. Just tried this > patch and the issue is still there. For the moment I can just mitigate with > https://patchwork.freedesktop.org/patch/211086/ These patch is purely to make sure that the irq_handler is not called after execlists_reset, as that is what we rely on. We *should* never be in a situation where CSB tail is invalid, that implies either we screwed up or the hw is incoherent. -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH] drm/i915: Flush pending interrupt following a GPU reset
On Wed, Mar 21, 2018 at 03:00:23PM +, Chris Wilson wrote: > After resetting the GPU (or subset of engines), call synchronize_irq() > to flush any pending irq before proceeding with the cleanup. For a > device level reset, we disable the interupts around the reset, but when > resetting just one engine, we have to avoid such global disabling. This > leaves us open to an interrupt arriving for the engine as we try to > reset it. We already do try to flush the IIR following the reset, but we > have to ensure that the in-flight interrupt does not land after we start > cleaning up after the reset; enter synchronize_irq(). > > As it current stands, we very rarely, but fatally, see sequences such as: > > 2 57964564us : execlists_reset_prepare: rcs0 > 2 57964613us : execlists_reset: rcs0 seqno=424 > 0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1 > 2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- > global_seqno 1060 > 2 57964703us : execlists_reset_finish: rcs0 > 0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, active=0, > irq-posted?=1 > I can repro this sequence easily with force preemption IGT. Just tried this patch and the issue is still there. For the moment I can just mitigate with https://patchwork.freedesktop.org/patch/211086/ > Signed-off-by: Chris Wilson> Cc: Mika Kuoppala > Cc: Michel Thierry > Cc: MichaĆ Winiarski > --- > drivers/gpu/drm/i915/intel_uncore.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/intel_uncore.c > b/drivers/gpu/drm/i915/intel_uncore.c > index 4c616d074a97..04830d6125d6 100644 > --- a/drivers/gpu/drm/i915/intel_uncore.c > +++ b/drivers/gpu/drm/i915/intel_uncore.c > @@ -2116,11 +2116,14 @@ int intel_gpu_reset(struct drm_i915_private > *dev_priv, unsigned engine_mask) > i915_stop_engines(dev_priv, engine_mask); > > ret = -ENODEV; > - if (reset) > + if (reset) { > + GEM_TRACE("engine_mask=%x\n", engine_mask); > ret = reset(dev_priv, engine_mask); > + } > if (ret != -ETIMEDOUT) > break; > > + synchronize_irq(dev_priv->drm.irq); > cond_resched(); > } > intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); > -- > 2.16.2 > > ___ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx