RE: [PATCH 2/2] drm/i915: Support replaying GPU hangs with captured context image

2024-02-26 Thread Santa, Carlos

> -Original Message-
> From: Tvrtko Ursulin 
> Sent: Monday, February 26, 2024 1:56 AM
> To: Vivi, Rodrigo 
> Cc: intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Ursulin,
> Tvrtko ; Landwerlin, Lionel G
> ; Santa, Carlos 
> Subject: Re: [PATCH 2/2] drm/i915: Support replaying GPU hangs with
> captured context image
> 
> 
> 
> On 22/02/2024 21:07, Rodrigo Vivi wrote:
> > On Wed, Feb 21, 2024 at 02:22:45PM +, Tvrtko Ursulin wrote:
> >> From: Tvrtko Ursulin 
> >>
> >> When debugging GPU hangs Mesa developers are finding it useful to
> >> replay the captured error state against the simulator. But due
> >> various simulator limitations which prevent replicating all hangs,
> >> one step further is being able to replay against a real GPU.
> >>
> >> This is almost doable today with the missing part being able to
> >> upload the captured context image into the driver state prior to
> >> executing the uploaded hanging batch and all the buffers.
> >>
> >> To enable this last part we add a new context parameter called
> >> I915_CONTEXT_PARAM_CONTEXT_IMAGE. It follows the existing SSEU
> >> configuration pattern of being able to select which context to apply
> >> against, paired with the actual image and its size.
> >>
> >> Since this is adding a new concept of debug only uapi, we hide it
> >> behind a new kconfig option and also require activation with a module
> parameter.
> >> Together with a warning banner printed at driver load, all those
> >> combined should be sufficient to guard against inadvertently enabling the
> feature.
> >>
> >> In terms of implementation we allow the legacy context set param to
> >> be used since that removes the need to record the per context data in
> >> the proto context, while still allowing flexibility of specifying
> >> context images for any context.
> >>
> >> Mesa MR using the uapi can be seen at:
> >>https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27594
> >>
> >> v2:
> >>   * Fix whitespace alignment as per checkpatch.
> >>   * Added warning on userspace misuse.
> >>   * Rebase for extracting ce->default_state shadowing.
> >>
> >> Signed-off-by: Tvrtko Ursulin 
> >> Cc: Lionel Landwerlin 
> >> Cc: Carlos Santa 
> >> Cc: Rodrigo Vivi 
> >> Reviewed-by: Rodrigo Vivi  # v1
> >
> > still valid for v2. Thanks for splitting the patch.
> 
> Great, thanks!
> 
> Now we need to hear from Lionel if he is still keen to have this. In which 
> case
> some acks or tested by would be good.
> 
> Regards,
> 
> Tvrtko

I tested a setup at my end and it reproduces the GPU hang. I am on a TGL based 
NUC system running Ubuntu 22.04.
There's a one-liner change that's needed in the corresponding Mesa but I've 
communicated that to Lionel already.

You can use my Tested-by: Carlos Santa 

> 
> >> ---
> >>   drivers/gpu/drm/i915/Kconfig.debug|  17 +++
> >>   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 113
> ++
> >>   drivers/gpu/drm/i915/gt/intel_context.c   |   2 +
> >>   drivers/gpu/drm/i915/gt/intel_context.h   |  22 
> >>   drivers/gpu/drm/i915/gt/intel_context_types.h |   1 +
> >>   drivers/gpu/drm/i915/gt/intel_lrc.c   |   3 +-
> >>   .../gpu/drm/i915/gt/intel_ring_submission.c   |   3 +-
> >>   drivers/gpu/drm/i915/i915_params.c|   5 +
> >>   drivers/gpu/drm/i915/i915_params.h|   3 +-
> >>   include/uapi/drm/i915_drm.h   |  27 +
> >>   10 files changed, 193 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/Kconfig.debug
> >> b/drivers/gpu/drm/i915/Kconfig.debug
> >> index 5b7162076850..32e9f70e91ed 100644
> >> --- a/drivers/gpu/drm/i915/Kconfig.debug
> >> +++ b/drivers/gpu/drm/i915/Kconfig.debug
> >> @@ -16,6 +16,23 @@ config DRM_I915_WERROR
> >>
> >>  If in doubt, say "N".
> >>
> >> +config DRM_I915_REPLAY_GPU_HANGS_API
> >> +  bool "Enable GPU hang replay userspace API"
> >> +  depends on DRM_I915
> >> +  depends on EXPERT
> >> +  default n
> >> +  help
> >> +Choose this option if you want to enable special and unstable
> >> +userspace API used for replaying GPU hangs on a running system.
> >> +
> >> +This API is intended to be used by userspace graphics stack

Re: [PATCH 2/2] drm/i915: Support replaying GPU hangs with captured context image

2024-02-26 Thread Tvrtko Ursulin




On 22/02/2024 21:07, Rodrigo Vivi wrote:

On Wed, Feb 21, 2024 at 02:22:45PM +, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

When debugging GPU hangs Mesa developers are finding it useful to replay
the captured error state against the simulator. But due various simulator
limitations which prevent replicating all hangs, one step further is being
able to replay against a real GPU.

This is almost doable today with the missing part being able to upload the
captured context image into the driver state prior to executing the
uploaded hanging batch and all the buffers.

To enable this last part we add a new context parameter called
I915_CONTEXT_PARAM_CONTEXT_IMAGE. It follows the existing SSEU
configuration pattern of being able to select which context to apply
against, paired with the actual image and its size.

Since this is adding a new concept of debug only uapi, we hide it behind
a new kconfig option and also require activation with a module parameter.
Together with a warning banner printed at driver load, all those combined
should be sufficient to guard against inadvertently enabling the feature.

In terms of implementation we allow the legacy context set param to be
used since that removes the need to record the per context data in the
proto context, while still allowing flexibility of specifying context
images for any context.

Mesa MR using the uapi can be seen at:
   https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27594

v2:
  * Fix whitespace alignment as per checkpatch.
  * Added warning on userspace misuse.
  * Rebase for extracting ce->default_state shadowing.

Signed-off-by: Tvrtko Ursulin 
Cc: Lionel Landwerlin 
Cc: Carlos Santa 
Cc: Rodrigo Vivi 
Reviewed-by: Rodrigo Vivi  # v1


still valid for v2. Thanks for splitting the patch.


Great, thanks!

Now we need to hear from Lionel if he is still keen to have this. In 
which case some acks or tested by would be good.


Regards,

Tvrtko


---
  drivers/gpu/drm/i915/Kconfig.debug|  17 +++
  drivers/gpu/drm/i915/gem/i915_gem_context.c   | 113 ++
  drivers/gpu/drm/i915/gt/intel_context.c   |   2 +
  drivers/gpu/drm/i915/gt/intel_context.h   |  22 
  drivers/gpu/drm/i915/gt/intel_context_types.h |   1 +
  drivers/gpu/drm/i915/gt/intel_lrc.c   |   3 +-
  .../gpu/drm/i915/gt/intel_ring_submission.c   |   3 +-
  drivers/gpu/drm/i915/i915_params.c|   5 +
  drivers/gpu/drm/i915/i915_params.h|   3 +-
  include/uapi/drm/i915_drm.h   |  27 +
  10 files changed, 193 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.debug 
b/drivers/gpu/drm/i915/Kconfig.debug
index 5b7162076850..32e9f70e91ed 100644
--- a/drivers/gpu/drm/i915/Kconfig.debug
+++ b/drivers/gpu/drm/i915/Kconfig.debug
@@ -16,6 +16,23 @@ config DRM_I915_WERROR
  
  	  If in doubt, say "N".
  
+config DRM_I915_REPLAY_GPU_HANGS_API

+   bool "Enable GPU hang replay userspace API"
+   depends on DRM_I915
+   depends on EXPERT
+   default n
+   help
+ Choose this option if you want to enable special and unstable
+ userspace API used for replaying GPU hangs on a running system.
+
+ This API is intended to be used by userspace graphics stack developers
+ and provides no stability guarantees.
+
+ The API needs to be activated at boot time using the
+ enable_debug_only_api module parameter.
+
+ If in doubt, say "N".
+
  config DRM_I915_DEBUG
bool "Enable additional driver debugging"
depends on DRM_I915
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dcbfe32fd30c..481aacbc1772 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -78,6 +78,7 @@
  #include "gt/intel_engine_user.h"
  #include "gt/intel_gpu_commands.h"
  #include "gt/intel_ring.h"
+#include "gt/shmem_utils.h"
  
  #include "pxp/intel_pxp.h"
  
@@ -949,6 +950,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv,

case I915_CONTEXT_PARAM_NO_ZEROMAP:
case I915_CONTEXT_PARAM_BAN_PERIOD:
case I915_CONTEXT_PARAM_RINGSIZE:
+   case I915_CONTEXT_PARAM_CONTEXT_IMAGE:
default:
ret = -EINVAL;
break;
@@ -2092,6 +2094,95 @@ static int get_protected(struct i915_gem_context *ctx,
return 0;
  }
  
+static int set_context_image(struct i915_gem_context *ctx,

+struct drm_i915_gem_context_param *args)
+{
+   struct i915_gem_context_param_context_image user;
+   struct intel_context *ce;
+   struct file *shmem_state;
+   unsigned long lookup;
+   void *state;
+   int ret = 0;
+
+   if (!IS_ENABLED(CONFIG_DRM_I915_REPLAY_GPU_HANGS_API))
+   return -EINVAL;
+
+   if (!ctx->i915->params.enable_debug_only_api)
+   return -EINVAL;
+
+   if (args->size < 

Re: [PATCH 2/2] drm/i915: Support replaying GPU hangs with captured context image

2024-02-22 Thread Rodrigo Vivi
On Wed, Feb 21, 2024 at 02:22:45PM +, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin 
> 
> When debugging GPU hangs Mesa developers are finding it useful to replay
> the captured error state against the simulator. But due various simulator
> limitations which prevent replicating all hangs, one step further is being
> able to replay against a real GPU.
> 
> This is almost doable today with the missing part being able to upload the
> captured context image into the driver state prior to executing the
> uploaded hanging batch and all the buffers.
> 
> To enable this last part we add a new context parameter called
> I915_CONTEXT_PARAM_CONTEXT_IMAGE. It follows the existing SSEU
> configuration pattern of being able to select which context to apply
> against, paired with the actual image and its size.
> 
> Since this is adding a new concept of debug only uapi, we hide it behind
> a new kconfig option and also require activation with a module parameter.
> Together with a warning banner printed at driver load, all those combined
> should be sufficient to guard against inadvertently enabling the feature.
> 
> In terms of implementation we allow the legacy context set param to be
> used since that removes the need to record the per context data in the
> proto context, while still allowing flexibility of specifying context
> images for any context.
> 
> Mesa MR using the uapi can be seen at:
>   https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27594
> 
> v2:
>  * Fix whitespace alignment as per checkpatch.
>  * Added warning on userspace misuse.
>  * Rebase for extracting ce->default_state shadowing.
> 
> Signed-off-by: Tvrtko Ursulin 
> Cc: Lionel Landwerlin 
> Cc: Carlos Santa 
> Cc: Rodrigo Vivi 
> Reviewed-by: Rodrigo Vivi  # v1

still valid for v2. Thanks for splitting the patch.

> ---
>  drivers/gpu/drm/i915/Kconfig.debug|  17 +++
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   | 113 ++
>  drivers/gpu/drm/i915/gt/intel_context.c   |   2 +
>  drivers/gpu/drm/i915/gt/intel_context.h   |  22 
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   1 +
>  drivers/gpu/drm/i915/gt/intel_lrc.c   |   3 +-
>  .../gpu/drm/i915/gt/intel_ring_submission.c   |   3 +-
>  drivers/gpu/drm/i915/i915_params.c|   5 +
>  drivers/gpu/drm/i915/i915_params.h|   3 +-
>  include/uapi/drm/i915_drm.h   |  27 +
>  10 files changed, 193 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig.debug 
> b/drivers/gpu/drm/i915/Kconfig.debug
> index 5b7162076850..32e9f70e91ed 100644
> --- a/drivers/gpu/drm/i915/Kconfig.debug
> +++ b/drivers/gpu/drm/i915/Kconfig.debug
> @@ -16,6 +16,23 @@ config DRM_I915_WERROR
>  
> If in doubt, say "N".
>  
> +config DRM_I915_REPLAY_GPU_HANGS_API
> + bool "Enable GPU hang replay userspace API"
> + depends on DRM_I915
> + depends on EXPERT
> + default n
> + help
> +   Choose this option if you want to enable special and unstable
> +   userspace API used for replaying GPU hangs on a running system.
> +
> +   This API is intended to be used by userspace graphics stack developers
> +   and provides no stability guarantees.
> +
> +   The API needs to be activated at boot time using the
> +   enable_debug_only_api module parameter.
> +
> +   If in doubt, say "N".
> +
>  config DRM_I915_DEBUG
>   bool "Enable additional driver debugging"
>   depends on DRM_I915
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
> b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index dcbfe32fd30c..481aacbc1772 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -78,6 +78,7 @@
>  #include "gt/intel_engine_user.h"
>  #include "gt/intel_gpu_commands.h"
>  #include "gt/intel_ring.h"
> +#include "gt/shmem_utils.h"
>  
>  #include "pxp/intel_pxp.h"
>  
> @@ -949,6 +950,7 @@ static int set_proto_ctx_param(struct 
> drm_i915_file_private *fpriv,
>   case I915_CONTEXT_PARAM_NO_ZEROMAP:
>   case I915_CONTEXT_PARAM_BAN_PERIOD:
>   case I915_CONTEXT_PARAM_RINGSIZE:
> + case I915_CONTEXT_PARAM_CONTEXT_IMAGE:
>   default:
>   ret = -EINVAL;
>   break;
> @@ -2092,6 +2094,95 @@ static int get_protected(struct i915_gem_context *ctx,
>   return 0;
>  }
>  
> +static int set_context_image(struct i915_gem_context *ctx,
> +  struct drm_i915_gem_context_param *args)
> +{
> + struct i915_gem_context_param_context_image user;
> + struct intel_context *ce;
> + struct file *shmem_state;
> + unsigned long lookup;
> + void *state;
> + int ret = 0;
> +
> + if (!IS_ENABLED(CONFIG_DRM_I915_REPLAY_GPU_HANGS_API))
> + return -EINVAL;
> +
> + if (!ctx->i915->params.enable_debug_only_api)
> + return -EINVAL;
> +
> + if (args->size < sizeof(user))
> + return -EINVAL;
> +

[PATCH 2/2] drm/i915: Support replaying GPU hangs with captured context image

2024-02-21 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

When debugging GPU hangs Mesa developers are finding it useful to replay
the captured error state against the simulator. But due various simulator
limitations which prevent replicating all hangs, one step further is being
able to replay against a real GPU.

This is almost doable today with the missing part being able to upload the
captured context image into the driver state prior to executing the
uploaded hanging batch and all the buffers.

To enable this last part we add a new context parameter called
I915_CONTEXT_PARAM_CONTEXT_IMAGE. It follows the existing SSEU
configuration pattern of being able to select which context to apply
against, paired with the actual image and its size.

Since this is adding a new concept of debug only uapi, we hide it behind
a new kconfig option and also require activation with a module parameter.
Together with a warning banner printed at driver load, all those combined
should be sufficient to guard against inadvertently enabling the feature.

In terms of implementation we allow the legacy context set param to be
used since that removes the need to record the per context data in the
proto context, while still allowing flexibility of specifying context
images for any context.

Mesa MR using the uapi can be seen at:
  https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27594

v2:
 * Fix whitespace alignment as per checkpatch.
 * Added warning on userspace misuse.
 * Rebase for extracting ce->default_state shadowing.

Signed-off-by: Tvrtko Ursulin 
Cc: Lionel Landwerlin 
Cc: Carlos Santa 
Cc: Rodrigo Vivi 
Reviewed-by: Rodrigo Vivi  # v1
---
 drivers/gpu/drm/i915/Kconfig.debug|  17 +++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 113 ++
 drivers/gpu/drm/i915/gt/intel_context.c   |   2 +
 drivers/gpu/drm/i915/gt/intel_context.h   |  22 
 drivers/gpu/drm/i915/gt/intel_context_types.h |   1 +
 drivers/gpu/drm/i915/gt/intel_lrc.c   |   3 +-
 .../gpu/drm/i915/gt/intel_ring_submission.c   |   3 +-
 drivers/gpu/drm/i915/i915_params.c|   5 +
 drivers/gpu/drm/i915/i915_params.h|   3 +-
 include/uapi/drm/i915_drm.h   |  27 +
 10 files changed, 193 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.debug 
b/drivers/gpu/drm/i915/Kconfig.debug
index 5b7162076850..32e9f70e91ed 100644
--- a/drivers/gpu/drm/i915/Kconfig.debug
+++ b/drivers/gpu/drm/i915/Kconfig.debug
@@ -16,6 +16,23 @@ config DRM_I915_WERROR
 
  If in doubt, say "N".
 
+config DRM_I915_REPLAY_GPU_HANGS_API
+   bool "Enable GPU hang replay userspace API"
+   depends on DRM_I915
+   depends on EXPERT
+   default n
+   help
+ Choose this option if you want to enable special and unstable
+ userspace API used for replaying GPU hangs on a running system.
+
+ This API is intended to be used by userspace graphics stack developers
+ and provides no stability guarantees.
+
+ The API needs to be activated at boot time using the
+ enable_debug_only_api module parameter.
+
+ If in doubt, say "N".
+
 config DRM_I915_DEBUG
bool "Enable additional driver debugging"
depends on DRM_I915
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dcbfe32fd30c..481aacbc1772 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -78,6 +78,7 @@
 #include "gt/intel_engine_user.h"
 #include "gt/intel_gpu_commands.h"
 #include "gt/intel_ring.h"
+#include "gt/shmem_utils.h"
 
 #include "pxp/intel_pxp.h"
 
@@ -949,6 +950,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private 
*fpriv,
case I915_CONTEXT_PARAM_NO_ZEROMAP:
case I915_CONTEXT_PARAM_BAN_PERIOD:
case I915_CONTEXT_PARAM_RINGSIZE:
+   case I915_CONTEXT_PARAM_CONTEXT_IMAGE:
default:
ret = -EINVAL;
break;
@@ -2092,6 +2094,95 @@ static int get_protected(struct i915_gem_context *ctx,
return 0;
 }
 
+static int set_context_image(struct i915_gem_context *ctx,
+struct drm_i915_gem_context_param *args)
+{
+   struct i915_gem_context_param_context_image user;
+   struct intel_context *ce;
+   struct file *shmem_state;
+   unsigned long lookup;
+   void *state;
+   int ret = 0;
+
+   if (!IS_ENABLED(CONFIG_DRM_I915_REPLAY_GPU_HANGS_API))
+   return -EINVAL;
+
+   if (!ctx->i915->params.enable_debug_only_api)
+   return -EINVAL;
+
+   if (args->size < sizeof(user))
+   return -EINVAL;
+
+   if (copy_from_user(, u64_to_user_ptr(args->value), sizeof(user)))
+   return -EFAULT;
+
+   if (user.mbz)
+   return -EINVAL;
+
+   if (user.flags & ~(I915_CONTEXT_IMAGE_FLAG_ENGINE_INDEX))
+   return -EINVAL;
+
+   lookup = 0;
+   if