Re: [Intel-gfx] [PATCH] drm/i915: Rework GPU reset sequence to match driver load & thaw

Daniel Vetter Mon, 04 Aug 2014 00:53:03 -0700

On Thu, Jul 31, 2014 at 04:37:14PM +0000, Mcaulay, Alistair wrote:
> Hi Daniel,
> 
> Something more like this then?  (and revert the change to intel_ring_begin(), 
> putting it back to how it was )


Yeah, roughly. Except that I would place the reload_in_reset wrapping in
the i915_reset function. It is paramount that we never leak this outside
of the dev->struct_mutex protection so that other threads can't ever
observe this to be set. So putting it right next to the mutex locking is
better.

Also I think you've wrapped the wrong function - the re-init is done in
i915_gem_init_hw, this here just resets the software state (mostly) and is
done before the actual gpu hw reset is done. gem_init_hw is only run if
the reset succeeds.
-Daniel

> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 991b663..b811ff2 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1217,6 +1217,9 @@ struct i915_gpu_error {
>  
>       /* For missed irq/seqno simulation. */
>       unsigned int test_irq_rings;
> +
> +     /* Used to prevent gem_check_wedged returning -EAGAIN during gpu reset  
>  */
> +     bool reload_in_progress;
>  };
>  
>  enum modeset_restore {
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index b38e086..a25d3b5 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1085,7 +1085,9 @@ i915_gem_check_wedge(struct i915_gpu_error *error,
>               if (i915_terminally_wedged(error))
>                       return -EIO;
>  
> -             return -EAGAIN;
> +             /* Check if GPU Reset is in progress */
> +             if (!error->reload_in_reset)
> +                     return -EAGAIN;
>       }
>  
>       return 0;
> @@ -2579,6 +2581,8 @@ void i915_gem_reset(struct drm_device *dev)
>       struct intel_engine_cs *ring;
>       int i;
>  
> +     /* Used to prevent gem_check_wedged returning -EAGAIN during gpu reset 
> */
> +     dev_priv->gpu_error.reload_in_reset = true;
>       /*
>        * Before we free the objects from the requests, we need to inspect
>        * them for finding the guilty party. As the requests only borrow
> @@ -2591,6 +2595,8 @@ void i915_gem_reset(struct drm_device *dev)
>               i915_gem_reset_ring_cleanup(dev_priv, ring);
>  
>       i915_gem_restore_fences(dev);
> +
> +     dev_priv->gpu_error.reload_in_reset = false;
>  }
> 
> 
> -----Original Message-----
> From: Daniel Vetter [mailto:[email protected]] On Behalf Of Daniel Vetter
> Sent: Wednesday, July 30, 2014 10:01 PM
> To: Mcaulay, Alistair
> Cc: Daniel Vetter; Chris Wilson; Ben Widawsky; [email protected]
> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Rework GPU reset sequence to match 
> driver load & thaw
> 
> On Wed, Jul 30, 2014 at 04:59:33PM +0000, Mcaulay, Alistair wrote:
> > Hi Daniel,
> > 
> > could you please be clearer on the change you mean.  I think you mean 
> > something functionally equivalent to the code below, but done in a less 
> > hacky way.
> > (This slight change has made no change to test results) Or is the idea 
> > to return at a different point to this?
> > I couldn't find " dev_priv->mm.reload_in_reset or similar" in the 
> > code. The only thing I can find is error->reset_counter, which is used 
> > in check_wedge(). Bottom bit set means RESET_IN_PROGRESS, top bit 
> > means WEDGED
> 
> Well I've meant that you have to add a new dev_prive->mm.realod_in_reset.
> And the below won't work since in all other places but when doing a gpu reset 
> we want the -EAGAIN to reach callers. Actually it's really important that if 
> we have an -EGAIN we don't eat it.
> 
> And I guess the check for mm.reload_in_reset should actually be in 
> gem_check_wedged.
> -Daniel
> 
> > 
> > 
> >  --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> >  +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> >  @@ -1832,7 +1832,9 @@ int intel_ring_begin(struct intel_engine_cs  
> > *ring,
> >   
> >     ret = i915_gem_check_wedge(&dev_priv->gpu_error,
> >                                dev_priv->mm.interruptible);
> >  -  if (ret)
> >  +
> >  +  /* -EAGAIN means a reset is in progress, it is Ok to return */
> >  +  if (ret == -EAGAIN)
> >  +          return 0;
> >  +  if (ret)
> >  +          return ret;
> >   
> >     ret = __intel_ring_prepare(ring, num_dwords * sizeof(uint32_t));
> > 
> > Alistair.
> > 
> > -----Original Message-----
> > From: Intel-gfx [mailto:[email protected]] On 
> > Behalf Of Daniel Vetter
> > Sent: Tuesday, July 29, 2014 11:33 AM
> > To: Chris Wilson; Daniel Vetter; Ben Widawsky; 
> > [email protected]
> > Subject: Re: [Intel-gfx] [PATCH] drm/i915: Rework GPU reset sequence 
> > to match driver load & thaw
> > 
> > On Tue, Jul 29, 2014 at 08:36:33AM +0100, Chris Wilson wrote:
> > > On Mon, Jul 28, 2014 at 11:26:38AM +0200, Daniel Vetter wrote:
> > > > Oh, I guess that's the tricky bit why the old approach never 
> > > > worked
> > > > - because reset_in_progress is set we failed the context/ppgtt 
> > > > loading through the rings and screwed up.
> > > > 
> > > > Problem with your approach is that we want to bail out here if a 
> > > > reset is in progress, so we can't just eat the EAGAIN. If we do 
> > > > that we potentially deadlock or overflow the ring.
> > > > 
> > > > I think we need a different hack here, and a few layers down (i.e. 
> > > > at the place where we actually generate that offending -EAGAIN).
> > > > 
> > > > - Around the re-init sequence in the reset function we set
> > > >   dev_priv->mm.reload_in_reset or similar
> > 
> > . Since we hold dev->struct_mutex
> > > >   no one will see that, as long as we never leak it out of the critical
> > > >   section.
> > > > 
> > > > - In the ring_begin code that checks for gpu hangs we ignore
> > > >   reset_in_progress if this bit is set.
> > > > 
> > > > - Both places need fairly big comments to explain what exactly is going
> > > >   on.
> > > 
> > > This is going from bad to worse. I think you can do better if you 
> > > looked at the problem afresh.
> > 
> > Well we can't really reset reset_in_progress at that point, since not all 
> > reset is done yet. Especially the modeset stuff. So I don't think that 
> > reordering the reset sequence would get us out of this ugly spot. And I 
> > don't see any other solution really. Do you?
> > -Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> > _______________________________________________
> > Intel-gfx mailing list
> > [email protected]
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH] drm/i915: Rework GPU reset sequence to match driver load & thaw

Reply via email to