Re: [PATCH] drm/i915: Set wedged if enable guc communication failed

2023-03-03 Thread Ceraolo Spurio, Daniele




On 3/2/2023 1:50 PM, Zhanjun Dong wrote:

Add err code check for enable_communication on resume path. When resume failed, 
we can no longer use the GPU, marking the GPU as wedged.


This is slightly incorrect. If we fail to enable communication, the 
consequence is that we can't use the GuC. That translates to a failure 
to use the GT only if GuC submission is enabled; in execlists submission 
mode we can keep going, although we might end up losing HuC 
functionality (which is not considered a fatal error). Therefore, the 
code below should be updated to reflect this.




Signed-off-by: Zhanjun Dong 
---
  drivers/gpu/drm/i915/gt/intel_gt_pm.c | 7 ++-
  drivers/gpu/drm/i915/gt/uc/intel_uc.c | 9 +++--
  2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index cef3d6f5c34e..42a7d75ce39e 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -401,8 +401,13 @@ int intel_gt_runtime_resume(struct intel_gt *gt)
intel_ggtt_restore_fences(gt->ggtt);
  
  	ret = intel_uc_runtime_resume(>uc);

-   if (ret)
+   if (ret) {
+   /* Resume failed, we can no longer use the GPU, marking the GPU
+* as wedged.
+*/
+   intel_gt_set_wedged(gt);


intel_gt_set_wedged calls intel_runtime_pm_get, so it will deadlock if 
you call if from within the runtime_resume flow.



return ret;
+   }
  
  	return 0;

  }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c 
b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
index 6648691bd645..d4f428acf20a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
@@ -698,8 +698,13 @@ static int __uc_resume(struct intel_uc *uc, bool 
enable_communication)
/* Make sure we enable communication if and only if it's disabled */
GEM_BUG_ON(enable_communication == intel_guc_ct_enabled(>ct));
  
-	if (enable_communication)

-   guc_enable_communication(guc);
+   if (enable_communication) {
+   err = guc_enable_communication(guc);
+   if (err) {
+   guc_dbg(guc, "Failed to resume, %pe", ERR_PTR(err));


nit: this isn't exactly a failure to resume because the GuC is running. 
It's more a failure to re-establish communication with the GuC.


Daniele


+   return err;
+   }
+   }
  
  	/* If we are only resuming GuC communication but not reloading

 * GuC, we need to ensure the ARAT timer interrupt is enabled




RE: [PATCH] drm/i915: Set wedged if enable guc communication failed

2023-03-02 Thread Dong, Zhanjun
Thanks Jani.
Updated patch sent, let me know if you have any comments.

Regards,
Zhanjun

> -Original Message-
> From: Jani Nikula 
> Sent: February 27, 2023 6:30 AM
> To: Dong, Zhanjun ; intel-
> g...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Cc: Dong, Zhanjun 
> Subject: Re: [PATCH] drm/i915: Set wedged if enable guc communication
> failed
> 
> On Fri, 24 Feb 2023, Zhanjun Dong  wrote:
> > Add err code check for enable_communication on resume path, set
> wedged if failed.
> 
> I can see that this is *what* the code does, but the commit message should
> answer the question *why*.
> 
> >
> > Signed-off-by: Zhanjun Dong 
> > ---
> >  drivers/gpu/drm/i915/gt/intel_gt_pm.c | 5 -
> > drivers/gpu/drm/i915/gt/uc/intel_uc.c | 9 +++--
> >  2 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > index cef3d6f5c34e..f3bb7cbbd293 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > @@ -401,8 +401,11 @@ int intel_gt_runtime_resume(struct intel_gt *gt)
> > intel_ggtt_restore_fences(gt->ggtt);
> >
> > ret = intel_uc_runtime_resume(>uc);
> > -   if (ret)
> > +   if (ret) {
> > +   /* Set wedge if uc resume failed */
> 
> This comment is just a reiteration of the C code in English, but doesn't
> provide any useful additional information.
> 
> BR,
> Jani.
> 
> > +   intel_gt_set_wedged(gt);
> > return ret;
> > +   }
> >
> > return 0;
> >  }
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> > b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> > index 6648691bd645..d4f428acf20a 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> > @@ -698,8 +698,13 @@ static int __uc_resume(struct intel_uc *uc, bool
> enable_communication)
> > /* Make sure we enable communication if and only if it's disabled */
> > GEM_BUG_ON(enable_communication ==
> intel_guc_ct_enabled(>ct));
> >
> > -   if (enable_communication)
> > -   guc_enable_communication(guc);
> > +   if (enable_communication) {
> > +   err = guc_enable_communication(guc);
> > +   if (err) {
> > +   guc_dbg(guc, "Failed to resume, %pe", ERR_PTR(err));
> > +   return err;
> > +   }
> > +   }
> >
> > /* If we are only resuming GuC communication but not reloading
> >  * GuC, we need to ensure the ARAT timer interrupt is enabled
> 
> --
> Jani Nikula, Intel Open Source Graphics Center


Re: [PATCH] drm/i915: Set wedged if enable guc communication failed

2023-02-27 Thread Jani Nikula
On Fri, 24 Feb 2023, Zhanjun Dong  wrote:
> Add err code check for enable_communication on resume path, set wedged if 
> failed.

I can see that this is *what* the code does, but the commit message
should answer the question *why*.

>
> Signed-off-by: Zhanjun Dong 
> ---
>  drivers/gpu/drm/i915/gt/intel_gt_pm.c | 5 -
>  drivers/gpu/drm/i915/gt/uc/intel_uc.c | 9 +++--
>  2 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
> b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index cef3d6f5c34e..f3bb7cbbd293 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -401,8 +401,11 @@ int intel_gt_runtime_resume(struct intel_gt *gt)
>   intel_ggtt_restore_fences(gt->ggtt);
>  
>   ret = intel_uc_runtime_resume(>uc);
> - if (ret)
> + if (ret) {
> + /* Set wedge if uc resume failed */

This comment is just a reiteration of the C code in English, but doesn't
provide any useful additional information.

BR,
Jani.

> + intel_gt_set_wedged(gt);
>   return ret;
> + }
>  
>   return 0;
>  }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> index 6648691bd645..d4f428acf20a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> @@ -698,8 +698,13 @@ static int __uc_resume(struct intel_uc *uc, bool 
> enable_communication)
>   /* Make sure we enable communication if and only if it's disabled */
>   GEM_BUG_ON(enable_communication == intel_guc_ct_enabled(>ct));
>  
> - if (enable_communication)
> - guc_enable_communication(guc);
> + if (enable_communication) {
> + err = guc_enable_communication(guc);
> + if (err) {
> + guc_dbg(guc, "Failed to resume, %pe", ERR_PTR(err));
> + return err;
> + }
> + }
>  
>   /* If we are only resuming GuC communication but not reloading
>* GuC, we need to ensure the ARAT timer interrupt is enabled

-- 
Jani Nikula, Intel Open Source Graphics Center