g4x: Improve gpu reset reliability

Mika Kuoppala Fri, 19 May 2017 04:23:44 -0700

Chris Wilson <[email protected]> writes:

> On Thu, May 18, 2017 at 05:28:41PM +0300, Mika Kuoppala wrote:
>> ELK seems to very picky about the preconditions to reset.
>> Evidence on Eaglelake (8086:2e12 (rev 03)) shows that it does
>> not like if reset occurs when there is active ring.
>> 
>> Ville found out that there is workaround with name
>> 'WaMediaResetMainRingCleanup' which suggests that we need to
>> cleanup rings before resetting. It is unclear what cleanup
>> exactly means but evidence shows that stopping the ring
>> does have an effect on reset reliability. This patch makes
>> reset succesful on hangs induced by chained batches (the igt ones).
>> Note that if the hang is inside a shader, it is possible
>> that our attempts to stop the ring achieves anything.
>> 
>> v2: zero ctl,head,tail also. bug ref. use driver debugs (Chris)
>> 
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100942
>> Testcase: igt/gem_busy/*-hang
>> Testcase: igt/gem_ringfill/hang-*
>
> Maybe add # elk to these to indicate the problem isn't quite that
> widespread!
>
>> Suggested-by: Ville Syrjälä <[email protected]>
>> Cc: Ville Syrjälä <[email protected]>
>> Cc: Chris Wilson <[email protected]>
>> Cc: Tomi Sarvela <[email protected]>
>> Signed-off-by: Mika Kuoppala <[email protected]>
>> ---
>>  drivers/gpu/drm/i915/intel_uncore.c | 35 +++++++++++++++++++++++++++++++++++
>>  1 file changed, 35 insertions(+)
>> 
>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c 
>> b/drivers/gpu/drm/i915/intel_uncore.c
>> index 7eaaf2225e1a..43da84be0321 100644
>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> @@ -1427,6 +1427,35 @@ int i915_reg_read_ioctl(struct drm_device *dev,
>>      return ret;
>>  }
>>  
>> +static void gen3_stop_rings(struct drm_i915_private *dev_priv)
>> +{
>> +    struct intel_engine_cs *engine;
>> +    enum intel_engine_id id;
>> +
>> +    for_each_engine(engine, dev_priv, id) {
>> +            const u32 base = engine->mmio_base;
>> +            const i915_reg_t mode = RING_MI_MODE(base);
>> +
>> +            I915_WRITE_FW(mode, _MASKED_BIT_ENABLE(STOP_RING));
>> +            if (intel_wait_for_register_fw(dev_priv,
>> +                                           mode,
>> +                                           MODE_IDLE,
>> +                                           MODE_IDLE,
>> +                                           500))
>> +                    DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n",
>> +                                     engine->name);
>> +
>> +            I915_WRITE_FW(RING_CTL(base), 0);
>> +            I915_WRITE_FW(RING_HEAD(base), 0);
>> +            I915_WRITE_FW(RING_TAIL(base), 0);
>> +
>> +            /* Check acts as a post */
>> +            if (I915_READ_FW(RING_HEAD(base)) != 0)
>> +                    DRM_DEBUG_DRIVER("%s: ring head not parked\n",
>> +                                     engine->name);
>> +    }
>> +}
>> +
>>  static bool i915_reset_complete(struct pci_dev *pdev)
>>  {
>>      u8 gdrst;
>> @@ -1472,6 +1501,12 @@ static int g4x_do_reset(struct drm_i915_private 
>> *dev_priv, unsigned engine_mask)
>>      I915_WRITE(VDECCLK_GATE_D, I915_READ(VDECCLK_GATE_D) | 
>> VCP_UNIT_CLOCK_GATE_DISABLE);
>>      POSTING_READ(VDECCLK_GATE_D);
>>  
>> +    /* We stop engines, otherwise we might get failed reset and a
>> +     * dead gpu (on elk).
>> +     */
>> +    /* WaMediaResetMainRingCleanup:ctg,elk (supposedly) */
>
> Join this into a single comment block, s/supposedly/presumably/
>
> Just a small concern we have some duplication of stop_ring() here, but I
> don't have a better suggestion (along the lines of export intel_stop_ring,
> gen3_engine_stop_ring, so far looks more confusing than helpful). As
> you


I had a patch which piggypacked engine->reset_hw(engine, NULL) to do
the dirty work of stopping the ring. But the stop_ring of()
intel_ringbuffer.c was giving up halfway if it didn't find
idling the ring succesful, leaving head/tail intact.
And that was on the prepare reset path. The boon was that
it stopped the rings before killing the tasklet.

But I decided to do more surgical approach directy on top of
reset. If we find another gen which is suspectible, we might
want to either piggypack on reset_hw or do a engine->stop_ring()
and use it in prepare for reset path.

> have tested with DRM_ERROR to be sure that fear about this simply
> timing out for our spinning batches, it looks good to me.
>

Spinning batches in general seem to go idle nice, but gem_ringfill
will spew out that ring_stop timeout debug.

> Reviewed-by: Chris Wilson <[email protected]>

Thanks for review. Patch pushed.
-Mika
_______________________________________________
Intel-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH v2 2/2] drm/i915/g4x: Improve gpu reset reliability

Reply via email to