On 5/28/2025 4:05 PM, Jesus Narvaez wrote:
There is a rare race condition when preparing for a reset where
guc_lrc_desc_unpin() could be in the process of deregistering a context
while a different thread is scrubbing outstanding contexts and it alters
the context state and does a wakeref put. Then, if there is a failure
with deregister_context(), a second wakeref put could occur. As a result
the wakeref count could drop below 0 and fail an INTEL_WAKEREF_BUG_ON()
check.

Therefore if there is a failure with deregister_context(), undo the
context state changes and do a wakeref put only if the context was set
to be destroyed earlier.

v2: Expand comment to better explain change. (Daniele)
v3: Removed addition to the original comment. (Daniele)

Fixes: 2f2cc53b5fe7 ("drm/i915/guc: Close deregister-context race against 
CT-loss")
Signed-off-by: Jesus Narvaez <jesus.narv...@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospu...@intel.com>
Cc: Alan Previn <alan.previn.teres.ale...@intel.com>
Cc: Anshuman Gupta <anshuman.gu...@intel.com>
Cc: Mousumi Jana <mousumi.j...@intel.com>
Cc: Rodrigo Vivi <rodrigo.v...@intel.com>
Cc: Matt Roper <matthew.d.ro...@intel.com>

Reviewed-by: Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospu...@intel.com>

Daniele

---
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 ++++++++++++++---
  1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 108331a69995..127316d2c8aa 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3443,18 +3443,29 @@ static inline int guc_lrc_desc_unpin(struct 
intel_context *ce)
         * GuC is active, lets destroy this context, but at this point we can 
still be racing
         * with suspend, so we undo everything if the H2G fails in 
deregister_context so
         * that GuC reset will find this context during clean up.
+        *
+        * There is a race condition where the reset code could have altered
+        * this context's state and done a wakeref put before we try to
+        * deregister it here. So check if the context is still set to be
+        * destroyed before undoing earlier changes, to avoid two wakeref puts
+        * on the same context.
         */
        ret = deregister_context(ce, ce->guc_id.id);
        if (ret) {
+               bool pending_destroyed;
                spin_lock_irqsave(&ce->guc_state.lock, flags);
-               set_context_registered(ce);
-               clr_context_destroyed(ce);
+               pending_destroyed = context_destroyed(ce);
+               if (pending_destroyed) {
+                       set_context_registered(ce);
+                       clr_context_destroyed(ce);
+               }
                spin_unlock_irqrestore(&ce->guc_state.lock, flags);
                /*
                 * As gt-pm is awake at function entry, intel_wakeref_put_async 
merely decrements
                 * the wakeref immediately but per function spec usage call 
this after unlock.
                 */
-               intel_wakeref_put_async(&gt->wakeref);
+               if (pending_destroyed)
+                       intel_wakeref_put_async(&gt->wakeref);
        }
return ret;

Reply via email to