Re: [PATCH] drm/i915/guc: Fix the fix for reset lock confusion

2024-04-03 Thread Andi Shyti
Hi John,

On Fri, Mar 29, 2024 at 04:53:05PM -0700, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> The previous fix for the circlular lock splat about the busyness
> worker wasn't quite complete. Even though the reset-in-progress flag
> is cleared at the start of intel_uc_reset_finish, the entire function
> is still inside the reset mutex lock. Not sure why the patch appeared
> to fix the issue both locally and in CI. However, it is now back
> again.
> 
> There is a further complication the wedge code path within
> intel_gt_reset() jumps around so much it results in nested
> reset_prepare/_finish calls. That is, the call sequence is:
>   intel_gt_reset
>   | reset_prepare
>   | __intel_gt_set_wedged
>   | | reset_prepare
>   | | reset_finish
>   | reset_finish
> 
> The nested finish means that even if the clear of the in-progress flag
> was moved to the end of _finish, it would still be clear for the
> entire second call. Surprisingly, this does not seem to be causing any
> other problems at present.
> 
> As an aside, a wedge on fini does not call the finish functions at
> all. The reset_in_progress flag is left set (twice).
> 
> So instead of trying to cancel the worker anywhere at all in the reset
> path, just add a cancel to intel_guc_submission_fini instead. Note
> that it is not a problem if the worker is still active during a reset.
> Either it will run before the reset path starts locking things and
> will simply block the reset code for a tiny amount of time. Or it will
> run after the locks have been acquired and will early exit due to the
> try-lock.
> 
> Also, do not use the reset-in-progress flag to decide whether a
> synchronous cancel is safe (from a lockdep perspective) or not.
> Instead, use the actual reset mutex state (both the genuine one and
> the custom rolled BACKOFF one).
> 
> Fixes: 0e00a8814eec ("drm/i915/guc: Avoid circular locking issue on busyness 
> flush")
> Signed-off-by: John Harrison 
> Cc: Zhanjun Dong 
> Cc: John Harrison 
> Cc: Andi Shyti 
> Cc: Daniel Vetter 
> Cc: Daniel Vetter 
> Cc: Rodrigo Vivi 
> Cc: Nirmoy Das 
> Cc: Tvrtko Ursulin 
> Cc: Umesh Nerlige Ramappa 
> Cc: Andrzej Hajda 
> Cc: Matt Roper 
> Cc: Jonathan Cavitt 
> Cc: Prathap Kumar Valsan 
> Cc: Alan Previn 
> Cc: Madhumitha Tolakanahalli Pradeep 
> 
> Cc: Daniele Ceraolo Spurio 
> Cc: Ashutosh Dixit 
> Cc: Dnyaneshwar Bhadane 

Looks good:

Reviewed-by: Andi Shyti 

Thanks,
Andi


Re: [PATCH] drm/i915/guc: Fix the fix for reset lock confusion

2024-04-03 Thread Nirmoy Das



On 3/30/2024 12:53 AM, john.c.harri...@intel.com wrote:

From: John Harrison 

The previous fix for the circlular lock splat about the busyness
worker wasn't quite complete. Even though the reset-in-progress flag
is cleared at the start of intel_uc_reset_finish, the entire function
is still inside the reset mutex lock. Not sure why the patch appeared
to fix the issue both locally and in CI. However, it is now back
again.

There is a further complication the wedge code path within
intel_gt_reset() jumps around so much it results in nested
reset_prepare/_finish calls. That is, the call sequence is:
   intel_gt_reset
   | reset_prepare
   | __intel_gt_set_wedged
   | | reset_prepare
   | | reset_finish
   | reset_finish

The nested finish means that even if the clear of the in-progress flag
was moved to the end of _finish, it would still be clear for the
entire second call. Surprisingly, this does not seem to be causing any
other problems at present.

As an aside, a wedge on fini does not call the finish functions at
all. The reset_in_progress flag is left set (twice).

So instead of trying to cancel the worker anywhere at all in the reset
path, just add a cancel to intel_guc_submission_fini instead. Note
that it is not a problem if the worker is still active during a reset.
Either it will run before the reset path starts locking things and
will simply block the reset code for a tiny amount of time. Or it will
run after the locks have been acquired and will early exit due to the
try-lock.

Also, do not use the reset-in-progress flag to decide whether a
synchronous cancel is safe (from a lockdep perspective) or not.
Instead, use the actual reset mutex state (both the genuine one and
the custom rolled BACKOFF one).

Fixes: 0e00a8814eec ("drm/i915/guc: Avoid circular locking issue on busyness 
flush")
Signed-off-by: John Harrison 
Cc: Zhanjun Dong 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Daniel Vetter 
Cc: Daniel Vetter 
Cc: Rodrigo Vivi 
Cc: Nirmoy Das 
Cc: Tvrtko Ursulin 
Cc: Umesh Nerlige Ramappa 
Cc: Andrzej Hajda 
Cc: Matt Roper 
Cc: Jonathan Cavitt 
Cc: Prathap Kumar Valsan 
Cc: Alan Previn 
Cc: Madhumitha Tolakanahalli Pradeep 

Cc: Daniele Ceraolo Spurio 
Cc: Ashutosh Dixit 
Cc: Dnyaneshwar Bhadane 


Thanks for the details, looks good to me:

Reviewed-by: Nirmoy Das 


---
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 ---
  drivers/gpu/drm/i915/gt/uc/intel_uc.c |  4 
  2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 16640d6dd0589..00757d6333e88 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1403,14 +1403,17 @@ static void guc_cancel_busyness_worker(struct intel_guc 
*guc)
 * Trying to pass a 'need_sync' or 'in_reset' flag all the way down 
through
 * every possible call stack is unfeasible. It would be too intrusive 
to many
 * areas that really don't care about the GuC backend. However, there 
is the
-* 'reset_in_progress' flag available, so just use that.
+* I915_RESET_BACKOFF flag and the gt->reset.mutex can be tested for 
is_locked.
+* So just use those. Note that testing both is required due to the 
hideously
+* complex nature of the i915 driver's reset code paths.
 *
 * And note that in the case of a reset occurring during driver unload
-* (wedge_on_fini), skipping the cancel in _prepare (when the reset 
flag is set
-* is fine because there is another cancel in _finish (when the reset 
flag is
-* not).
+* (wedged_on_fini), skipping the cancel in reset_prepare/reset_fini 
(when the
+* reset flag/mutex are set) is fine because there is another explicit 
cancel in
+* intel_guc_submission_fini (when the reset flag/mutex are not).
 */
-   if (guc_to_gt(guc)->uc.reset_in_progress)
+   if (mutex_is_locked(_to_gt(guc)->reset.mutex) ||
+   test_bit(I915_RESET_BACKOFF, _to_gt(guc)->reset.flags))
cancel_delayed_work(>timestamp.work);
else
cancel_delayed_work_sync(>timestamp.work);
@@ -1424,8 +1427,6 @@ static void __reset_guc_busyness_stats(struct intel_guc 
*guc)
unsigned long flags;
ktime_t unused;
  
-	guc_cancel_busyness_worker(guc);

-
spin_lock_irqsave(>timestamp.lock, flags);
  
  	guc_update_pm_timestamp(guc, );

@@ -2004,13 +2005,6 @@ void intel_guc_submission_cancel_requests(struct 
intel_guc *guc)
  
  void intel_guc_submission_reset_finish(struct intel_guc *guc)

  {
-   /*
-* Ensure the busyness worker gets cancelled even on a fatal wedge.
-* Note that reset_prepare is not allowed to because it confuses 
lockdep.
-*/
-   if (guc_submission_initialized(guc))
-   guc_cancel_busyness_worker(guc);
-
/* Reset 

[PATCH] drm/i915/guc: Fix the fix for reset lock confusion

2024-03-29 Thread John . C . Harrison
From: John Harrison 

The previous fix for the circlular lock splat about the busyness
worker wasn't quite complete. Even though the reset-in-progress flag
is cleared at the start of intel_uc_reset_finish, the entire function
is still inside the reset mutex lock. Not sure why the patch appeared
to fix the issue both locally and in CI. However, it is now back
again.

There is a further complication the wedge code path within
intel_gt_reset() jumps around so much it results in nested
reset_prepare/_finish calls. That is, the call sequence is:
  intel_gt_reset
  | reset_prepare
  | __intel_gt_set_wedged
  | | reset_prepare
  | | reset_finish
  | reset_finish

The nested finish means that even if the clear of the in-progress flag
was moved to the end of _finish, it would still be clear for the
entire second call. Surprisingly, this does not seem to be causing any
other problems at present.

As an aside, a wedge on fini does not call the finish functions at
all. The reset_in_progress flag is left set (twice).

So instead of trying to cancel the worker anywhere at all in the reset
path, just add a cancel to intel_guc_submission_fini instead. Note
that it is not a problem if the worker is still active during a reset.
Either it will run before the reset path starts locking things and
will simply block the reset code for a tiny amount of time. Or it will
run after the locks have been acquired and will early exit due to the
try-lock.

Also, do not use the reset-in-progress flag to decide whether a
synchronous cancel is safe (from a lockdep perspective) or not.
Instead, use the actual reset mutex state (both the genuine one and
the custom rolled BACKOFF one).

Fixes: 0e00a8814eec ("drm/i915/guc: Avoid circular locking issue on busyness 
flush")
Signed-off-by: John Harrison 
Cc: Zhanjun Dong 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Daniel Vetter 
Cc: Daniel Vetter 
Cc: Rodrigo Vivi 
Cc: Nirmoy Das 
Cc: Tvrtko Ursulin 
Cc: Umesh Nerlige Ramappa 
Cc: Andrzej Hajda 
Cc: Matt Roper 
Cc: Jonathan Cavitt 
Cc: Prathap Kumar Valsan 
Cc: Alan Previn 
Cc: Madhumitha Tolakanahalli Pradeep 

Cc: Daniele Ceraolo Spurio 
Cc: Ashutosh Dixit 
Cc: Dnyaneshwar Bhadane 
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 ---
 drivers/gpu/drm/i915/gt/uc/intel_uc.c |  4 
 2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 16640d6dd0589..00757d6333e88 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1403,14 +1403,17 @@ static void guc_cancel_busyness_worker(struct intel_guc 
*guc)
 * Trying to pass a 'need_sync' or 'in_reset' flag all the way down 
through
 * every possible call stack is unfeasible. It would be too intrusive 
to many
 * areas that really don't care about the GuC backend. However, there 
is the
-* 'reset_in_progress' flag available, so just use that.
+* I915_RESET_BACKOFF flag and the gt->reset.mutex can be tested for 
is_locked.
+* So just use those. Note that testing both is required due to the 
hideously
+* complex nature of the i915 driver's reset code paths.
 *
 * And note that in the case of a reset occurring during driver unload
-* (wedge_on_fini), skipping the cancel in _prepare (when the reset 
flag is set
-* is fine because there is another cancel in _finish (when the reset 
flag is
-* not).
+* (wedged_on_fini), skipping the cancel in reset_prepare/reset_fini 
(when the
+* reset flag/mutex are set) is fine because there is another explicit 
cancel in
+* intel_guc_submission_fini (when the reset flag/mutex are not).
 */
-   if (guc_to_gt(guc)->uc.reset_in_progress)
+   if (mutex_is_locked(_to_gt(guc)->reset.mutex) ||
+   test_bit(I915_RESET_BACKOFF, _to_gt(guc)->reset.flags))
cancel_delayed_work(>timestamp.work);
else
cancel_delayed_work_sync(>timestamp.work);
@@ -1424,8 +1427,6 @@ static void __reset_guc_busyness_stats(struct intel_guc 
*guc)
unsigned long flags;
ktime_t unused;
 
-   guc_cancel_busyness_worker(guc);
-
spin_lock_irqsave(>timestamp.lock, flags);
 
guc_update_pm_timestamp(guc, );
@@ -2004,13 +2005,6 @@ void intel_guc_submission_cancel_requests(struct 
intel_guc *guc)
 
 void intel_guc_submission_reset_finish(struct intel_guc *guc)
 {
-   /*
-* Ensure the busyness worker gets cancelled even on a fatal wedge.
-* Note that reset_prepare is not allowed to because it confuses 
lockdep.
-*/
-   if (guc_submission_initialized(guc))
-   guc_cancel_busyness_worker(guc);
-
/* Reset called during driver load or during wedge? */
if (unlikely(!guc_submission_initialized(guc) ||