Re: [Intel-gfx] [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery

2017-04-25 Thread Chris Wilson
On Mon, Apr 24, 2017 at 02:22:21PM -0700, Michel Thierry wrote:
> 
> 
> On 20/04/17 17:17, Michel Thierry wrote:
> >>Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
> >>struct_mutex) to skip replaying innocent requests, but here we should be
> >>asserting that we do have the hung request.
> >>
> >>i.e.
> >>request = i915_gem_find_active_request(engine);
> >>if (!request)
> >>goto skip.
> >>
> >>Bonus points for tying that into i915_gem_reset_prepare_engine() so that
> >>we only seach for the active_request once.
> >>
> 
> Will this do it?
> https://patchwork.freedesktop.org/patch/152494/  (ignore the
> DRM_ERROR I still have to change)

That's a little uglier than I was hoping for :(
 
> I'm not sure about reusing the active request in full-reset (what if
> we have more than one engine hung?).

We can have more than one active request, just don't forget a
if (i915_gem_request_completed(engine->hangcheck.hung_request))
/* skip the reset! */
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery

2017-04-24 Thread Michel Thierry



On 20/04/17 17:17, Michel Thierry wrote:

Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
struct_mutex) to skip replaying innocent requests, but here we should be
asserting that we do have the hung request.

i.e.
request = i915_gem_find_active_request(engine);
if (!request)
goto skip.

Bonus points for tying that into i915_gem_reset_prepare_engine() so that
we only seach for the active_request once.



Will this do it?
https://patchwork.freedesktop.org/patch/152494/  (ignore the DRM_ERROR I 
still have to change)


I'm not sure about reusing the active request in full-reset (what if we 
have more than one engine hung?).


Thanks
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery

2017-04-20 Thread Michel Thierry

On 19/04/17 03:49, Chris Wilson wrote:

On Tue, Apr 18, 2017 at 01:23:20PM -0700, Michel Thierry wrote:

From: Arun Siluvery 

This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
 - identifies the request that caused the hang and it is dropped
 - force engine to idle: this is done by issuing a reset request
 - reset and re-init engine
 - restart submissions to the engine

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

v2: Rebase.
v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
calling i915_gem_reset_engine (Chris).
v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
reuse the function for reset_engine.
v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
v6: Clean up reset_engine function to not require mutex, i.e. no need to call
revoke/restore_fences and _retire_requests (Chris).

Cc: Chris Wilson 
Cc: Mika Kuoppala 
Signed-off-by: Tomas Elf 
Signed-off-by: Arun Siluvery 
Signed-off-by: Michel Thierry 
---
 drivers/gpu/drm/i915/i915_drv.c | 76 --
 drivers/gpu/drm/i915/i915_drv.h | 12 +++-
 drivers/gpu/drm/i915/i915_gem.c | 97 +++--
 drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
 drivers/gpu/drm/i915/intel_uncore.c | 20 +++
 5 files changed, 158 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index e03d0643dbd6..634893cd93b3 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)

pr_notice("drm/i915: Resetting chip after gpu hang\n");
disable_irq(dev_priv->drm.irq);
-   ret = i915_gem_reset_prepare(dev_priv);
+   ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
if (ret) {
DRM_ERROR("GPU recovery failed\n");
intel_gpu_reset(dev_priv, ALL_ENGINES);
@@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
i915_queue_hangcheck(dev_priv);

 finish:
-   i915_gem_reset_finish(dev_priv);
+   i915_gem_reset_finish(dev_priv, ALL_ENGINES);
enable_irq(dev_priv->drm.irq);

 wakeup:
@@ -1871,11 +1871,79 @@ void i915_reset(struct drm_i915_private *dev_priv)
  *
  * Reset a specific GPU engine. Useful if a hang is detected.
  * Returns zero on successful reset or otherwise an error code.
+ *
+ * Procedure is:
+ *  - identifies the request that caused the hang and it is dropped
+ *  - force engine to idle: this is done by issuing a reset request
+ *  - reset engine
+ *  - restart submissions to the engine
  */
 int i915_reset_engine(struct intel_engine_cs *engine)
 {
-   /* FIXME: replace me with engine reset sequence */
-   return -ENODEV;
+   int ret;
+   struct drm_i915_private *dev_priv = engine->i915;
+   struct i915_gpu_error *error = _priv->gpu_error;
+
+   GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, >flags));
+
+   DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
+
+   /*
+* We need to first idle the engine by issuing a reset request,
+* then perform soft reset and re-initialize hw state, for all of
+* this GT power need to be awake so ensure it does throughout the
+* process
+*/
+   intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);


Hmm, what path did we take to get here without taking rpm? I thought I
had fixed the last offender.



Too many rebases... As you say, this is no longer needed after 
1604a86d08053 "drm/i915: Extend rpm wakelock during i915_handle_error()"



+   disable_irq(dev_priv->drm.irq);


I am 99% certain that we don't need to disable_irq() now for per-engine
reset... I'd keep it in the global reset as simple paranoia.



100% correct.


+   ret = i915_gem_reset_prepare_engine(engine);
+   if (ret) {
+   DRM_ERROR("Previous reset failed - promote to full reset\n");
+   goto error;
+   }
+
+   /*
+* the request that caused the hang is stuck on elsp, identify the
+* active request and drop it, adjust head to skip the offending
+* request to resume executing remaining requests in the queue.
+*/


Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
struct_mutex) to skip replaying innocent 

Re: [Intel-gfx] [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery

2017-04-19 Thread Chris Wilson
On Wed, Apr 19, 2017 at 11:49:26AM +0100, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:20PM -0700, Michel Thierry wrote:
> > +   ret = i915_gem_reset_prepare_engine(engine);
> > +   if (ret) {
> > +   DRM_ERROR("Previous reset failed - promote to full reset\n");
> > +   goto error;
> > +   }
> > +
> > +   /*
> > +* the request that caused the hang is stuck on elsp, identify the
> > +* active request and drop it, adjust head to skip the offending
> > +* request to resume executing remaining requests in the queue.
> > +*/
> 
> Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
> struct_mutex) to skip replaying innocent requests, but here we should be
> asserting that we do have the hung request.
> 
> i.e.
> request = i915_gem_find_active_request(engine);
> if (!request)
>   goto skip.

Forgot to mention this should include a "pardoned" check, i.e. that the
active request still matches the watchdog seqno.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery

2017-04-19 Thread Chris Wilson
On Tue, Apr 18, 2017 at 01:23:20PM -0700, Michel Thierry wrote:
> From: Arun Siluvery 
> 
> This change implements support for per-engine reset as an initial, less
> intrusive hang recovery option to be attempted before falling back to the
> legacy full GPU reset recovery mode if necessary. This is only supported
> from Gen8 onwards.
> 
> Hangchecker determines which engines are hung and invokes error handler to
> recover from it. Error handler schedules recovery for each of those engines
> that are hung. The recovery procedure is as follows,
>  - identifies the request that caused the hang and it is dropped
>  - force engine to idle: this is done by issuing a reset request
>  - reset and re-init engine
>  - restart submissions to the engine
> 
> If engine reset fails then we fall back to heavy weight full gpu reset
> which resets all engines and reinitiazes complete state of HW and SW.
> 
> v2: Rebase.
> v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
> calling i915_gem_reset_engine (Chris).
> v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
> reuse the function for reset_engine.
> v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
> v6: Clean up reset_engine function to not require mutex, i.e. no need to call
> revoke/restore_fences and _retire_requests (Chris).
> 
> Cc: Chris Wilson 
> Cc: Mika Kuoppala 
> Signed-off-by: Tomas Elf 
> Signed-off-by: Arun Siluvery 
> Signed-off-by: Michel Thierry 
> ---
>  drivers/gpu/drm/i915/i915_drv.c | 76 --
>  drivers/gpu/drm/i915/i915_drv.h | 12 +++-
>  drivers/gpu/drm/i915/i915_gem.c | 97 
> +++--
>  drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
>  drivers/gpu/drm/i915/intel_uncore.c | 20 +++
>  5 files changed, 158 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index e03d0643dbd6..634893cd93b3 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  
>   pr_notice("drm/i915: Resetting chip after gpu hang\n");
>   disable_irq(dev_priv->drm.irq);
> - ret = i915_gem_reset_prepare(dev_priv);
> + ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
>   if (ret) {
>   DRM_ERROR("GPU recovery failed\n");
>   intel_gpu_reset(dev_priv, ALL_ENGINES);
> @@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>   i915_queue_hangcheck(dev_priv);
>  
>  finish:
> - i915_gem_reset_finish(dev_priv);
> + i915_gem_reset_finish(dev_priv, ALL_ENGINES);
>   enable_irq(dev_priv->drm.irq);
>  
>  wakeup:
> @@ -1871,11 +1871,79 @@ void i915_reset(struct drm_i915_private *dev_priv)
>   *
>   * Reset a specific GPU engine. Useful if a hang is detected.
>   * Returns zero on successful reset or otherwise an error code.
> + *
> + * Procedure is:
> + *  - identifies the request that caused the hang and it is dropped
> + *  - force engine to idle: this is done by issuing a reset request
> + *  - reset engine
> + *  - restart submissions to the engine
>   */
>  int i915_reset_engine(struct intel_engine_cs *engine)
>  {
> - /* FIXME: replace me with engine reset sequence */
> - return -ENODEV;
> + int ret;
> + struct drm_i915_private *dev_priv = engine->i915;
> + struct i915_gpu_error *error = _priv->gpu_error;
> +
> + GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, >flags));
> +
> + DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
> +
> + /*
> +  * We need to first idle the engine by issuing a reset request,
> +  * then perform soft reset and re-initialize hw state, for all of
> +  * this GT power need to be awake so ensure it does throughout the
> +  * process
> +  */
> + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);

Hmm, what path did we take to get here without taking rpm? I thought I
had fixed the last offender.

> + disable_irq(dev_priv->drm.irq);

I am 99% certain that we don't need to disable_irq() now for per-engine
reset... I'd keep it in the global reset as simple paranoia.

> + ret = i915_gem_reset_prepare_engine(engine);
> + if (ret) {
> + DRM_ERROR("Previous reset failed - promote to full reset\n");
> + goto error;
> + }
> +
> + /*
> +  * the request that caused the hang is stuck on elsp, identify the
> +  * active request and drop it, adjust head to skip the offending
> +  * request to resume executing remaining requests in the queue.
> +  */

Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
struct_mutex) to skip replaying innocent requests, but here we should be
asserting that we do have the 

[Intel-gfx] [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery

2017-04-18 Thread Michel Thierry
From: Arun Siluvery 

This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
 - identifies the request that caused the hang and it is dropped
 - force engine to idle: this is done by issuing a reset request
 - reset and re-init engine
 - restart submissions to the engine

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

v2: Rebase.
v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
calling i915_gem_reset_engine (Chris).
v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
reuse the function for reset_engine.
v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
v6: Clean up reset_engine function to not require mutex, i.e. no need to call
revoke/restore_fences and _retire_requests (Chris).

Cc: Chris Wilson 
Cc: Mika Kuoppala 
Signed-off-by: Tomas Elf 
Signed-off-by: Arun Siluvery 
Signed-off-by: Michel Thierry 
---
 drivers/gpu/drm/i915/i915_drv.c | 76 --
 drivers/gpu/drm/i915/i915_drv.h | 12 +++-
 drivers/gpu/drm/i915/i915_gem.c | 97 +++--
 drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
 drivers/gpu/drm/i915/intel_uncore.c | 20 +++
 5 files changed, 158 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index e03d0643dbd6..634893cd93b3 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
 
pr_notice("drm/i915: Resetting chip after gpu hang\n");
disable_irq(dev_priv->drm.irq);
-   ret = i915_gem_reset_prepare(dev_priv);
+   ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
if (ret) {
DRM_ERROR("GPU recovery failed\n");
intel_gpu_reset(dev_priv, ALL_ENGINES);
@@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
i915_queue_hangcheck(dev_priv);
 
 finish:
-   i915_gem_reset_finish(dev_priv);
+   i915_gem_reset_finish(dev_priv, ALL_ENGINES);
enable_irq(dev_priv->drm.irq);
 
 wakeup:
@@ -1871,11 +1871,79 @@ void i915_reset(struct drm_i915_private *dev_priv)
  *
  * Reset a specific GPU engine. Useful if a hang is detected.
  * Returns zero on successful reset or otherwise an error code.
+ *
+ * Procedure is:
+ *  - identifies the request that caused the hang and it is dropped
+ *  - force engine to idle: this is done by issuing a reset request
+ *  - reset engine
+ *  - restart submissions to the engine
  */
 int i915_reset_engine(struct intel_engine_cs *engine)
 {
-   /* FIXME: replace me with engine reset sequence */
-   return -ENODEV;
+   int ret;
+   struct drm_i915_private *dev_priv = engine->i915;
+   struct i915_gpu_error *error = _priv->gpu_error;
+
+   GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, >flags));
+
+   DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
+
+   /*
+* We need to first idle the engine by issuing a reset request,
+* then perform soft reset and re-initialize hw state, for all of
+* this GT power need to be awake so ensure it does throughout the
+* process
+*/
+   intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+   disable_irq(dev_priv->drm.irq);
+   ret = i915_gem_reset_prepare_engine(engine);
+   if (ret) {
+   DRM_ERROR("Previous reset failed - promote to full reset\n");
+   goto error;
+   }
+
+   /*
+* the request that caused the hang is stuck on elsp, identify the
+* active request and drop it, adjust head to skip the offending
+* request to resume executing remaining requests in the queue.
+*/
+   i915_gem_reset_engine(engine);
+
+   /* forcing engine to idle */
+   ret = intel_reset_engine_start(engine);
+   if (ret) {
+   DRM_ERROR("Failed to disable %s\n", engine->name);
+   goto error;
+   }
+
+   /* finally, reset engine */
+   ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+   if (ret) {
+   DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+   intel_reset_engine_cancel(engine);
+   goto error;
+   }
+
+   /* be sure the request reset bit gets cleared */
+