Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump

2023-01-18 Thread John Harrison

On 1/18/2023 09:54, Andy Shevchenko wrote:

On Wed, Jan 18, 2023 at 09:34:47AM -0800, John Harrison wrote:

On 1/18/2023 00:29, Andy Shevchenko wrote:

On Tue, Jan 17, 2023 at 01:36:26PM -0800, john.c.harri...@intel.com wrote:

From: John Harrison 

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count. So no change to that
code itself but the context version does change.

The only other caller is the code for dumping engine state to debugfs.
That code wasn't previously getting an explicit reference at all as it
does everything while holding the execlist specific spinlock. So that
needs updaing as well as that spinlock doesn't help when using GuC
submission. Rather than trying to conditionally get/put depending on
submission model, just change it to always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up too.
Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
with GuC")

Must be one line.

In my tree it is one line. git itself does the line wrap when creating the
email.

Can you elaborate? I never have had such issue with git send-email (starting
from v1.6.x of Git for sure).
Hmm. Confused. I think it must have been something accidental in a text 
editor when reviewing the patch. Re-creating the emails now isn't 
wrapping it.



I missed that I need to manually unwrap it again before actually
sending the email. Although the CI checkpatch also pointed this out in it's
own obscure manner.

...


Cc: Matthew Brost 
Cc: John Harrison 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Daniele Ceraolo Spurio 
Cc: Andrzej Hajda 
Cc: Chris Wilson 
Cc: Matthew Auld 
Cc: Matt Roper 
Cc: Umesh Nerlige Ramappa 
Cc: Michael Cheng 
Cc: Lucas De Marchi 
Cc: Tejas Upadhyay 
Cc: Andy Shevchenko 
Cc: Aravind Iddamsetty 
Cc: Alan Previn 
Cc: Bruce Chang 
Cc: intel-gfx@lists.freedesktop.org

Is it possible to utilize --to --cc parameters to git send-email instead of
noisy Cc list?

This is the list auto-generated by the 'dim fixes' tool. I am told this is
the officially correct way to create a fixes patch - copy the output from
'dim fixes' as is into the patch headers.

Okay, so it may be question to the `dim` tool then...

...


Stray change.

Intentional change to improve the readability of a function that is being
modified by other changes in this patch.

But not described in the commit message. That's why "stray".

Didn't seem worth mentioning. I can add a comment about it.

John.







Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump

2023-01-18 Thread John Harrison

On 1/18/2023 08:22, Tvrtko Ursulin wrote:

On 17/01/2023 21:36, john.c.harri...@intel.com wrote:

From: John Harrison 

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count. So no change to that
code itself but the context version does change.

The only other caller is the code for dumping engine state to debugfs.
That code wasn't previously getting an explicit reference at all as it
does everything while holding the execlist specific spinlock. So that
needs updaing as well as that spinlock doesn't help when using GuC
submission. Rather than trying to conditionally get/put depending on
submission model, just change it to always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up too.

Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full 
GPU reset

with GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context 
reset")

Cc: Matthew Brost 
Cc: John Harrison 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Daniele Ceraolo Spurio 
Cc: Andrzej Hajda 
Cc: Chris Wilson 
Cc: Matthew Auld 
Cc: Matt Roper 
Cc: Umesh Nerlige Ramappa 
Cc: Michael Cheng 
Cc: Lucas De Marchi 
Cc: Tejas Upadhyay 
Cc: Andy Shevchenko 
Cc: Aravind Iddamsetty 
Cc: Alan Previn 
Cc: Bruce Chang 
Cc: intel-gfx@lists.freedesktop.org
Signed-off-by: John Harrison 
---
  drivers/gpu/drm/i915/gt/intel_context.c   |  1 +
  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  7 ++-
  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++
  drivers/gpu/drm/i915/i915_gpu_error.c |  5 ++---
  4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
b/drivers/gpu/drm/i915/gt/intel_context.c

index e94365b08f1ef..df64cf1954c1d 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -552,6 +552,7 @@ struct i915_request 
*intel_context_find_active_request(struct intel_context *ce)

    active = rq;
  }
+    active = i915_request_get_rcu(active);
  spin_unlock_irqrestore(&parent->guc_state.lock, flags);
    return active;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c

index 922f1bb22dc68..517d1fb7ae333 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2236,10 +2236,13 @@ static void 
engine_dump_active_requests(struct intel_engine_cs *engine, struct d

  guc = intel_uc_uses_guc_submission(&engine->gt->uc);
  if (guc) {
  ce = intel_engine_get_hung_context(engine);
-    if (ce)
+    if (ce) {
+    /* This will reference count the request (if found) */
  hung_rq = intel_context_find_active_request(ce);
+    }
  } else {
  hung_rq = intel_engine_execlist_find_hung_request(engine);
+    hung_rq = i915_request_get_rcu(hung_rq);


Looks like intel_engine_execlist_find_hung_request can return NULL 
which i915_request_get_rcu will not handle.

Doh! That is correct.



Maybe it would come up simpler if intel_context_find_active_request 
wouldn't be getting the reference and then you can get one here at a 
single place for both branches?
That would require moving the spinlock outside of 
intel_context_find_active_request so that it can be held while acquiring 
the request reference. And that means bleeding internal knowledge of 
which spinlock outside of the implementation and into the caller. As 
noted, the ideal would be extending the execlist implementation to do 
early tagging of the hung context/request at the point of hang 
detection. As opposed to rescanning the entire request list again at 
this point. And that will mean the lock being used inside 
'context_find_active' would be dependent upon GuC vs execlist backend. 
Which is an implementation detail we really should not be leaking out to 
the caller.


IMHO, it would be better to refactor engine_dump_active_requests() to 
acquire the sched_engine spinlock internally and only around the code 
which actually needs it (some of which is maybe execlist specific and 
not valid with GuC submission?). Certainly grabbing two independent 
spinlocks in a nested manner is not a good idea when there is no reason 
to do so.


John.




  }
    if (hung_rq)
@@ -2250,6 +2253,8 @@ static void engine_dump_active_requests(struct 
intel_engine_cs *engine, struct d

  else
intel_engine_dump_active_requests(&engine->sched_engine->requests,
    hung_rq, m);
+    if (hung_rq)
+    i915_request_put(hung_rq);
  }
    void

Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump

2023-01-18 Thread Andy Shevchenko
On Wed, Jan 18, 2023 at 09:34:47AM -0800, John Harrison wrote:
> On 1/18/2023 00:29, Andy Shevchenko wrote:
> > On Tue, Jan 17, 2023 at 01:36:26PM -0800, john.c.harri...@intel.com wrote:
> > > From: John Harrison 
> > > 
> > > When GuC support was added to error capture, the locking around the
> > > request object was broken. Fix it up.
> > > 
> > > The context based search manages the spinlocking around the search
> > > internally. So it needs to grab the reference count internally as
> > > well. The execlist only request based search relies on external
> > > locking, so it needs an external reference count. So no change to that
> > > code itself but the context version does change.
> > > 
> > > The only other caller is the code for dumping engine state to debugfs.
> > > That code wasn't previously getting an explicit reference at all as it
> > > does everything while holding the execlist specific spinlock. So that
> > > needs updaing as well as that spinlock doesn't help when using GuC
> > > submission. Rather than trying to conditionally get/put depending on
> > > submission model, just change it to always do the get/put.
> > > 
> > > In addition, intel_guc_find_hung_context() was not acquiring the
> > > correct spinlock before searching the request list. So fix that up too.
> > > Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU 
> > > reset
> > > with GuC")
> > Must be one line.
> In my tree it is one line. git itself does the line wrap when creating the
> email.

Can you elaborate? I never have had such issue with git send-email (starting
from v1.6.x of Git for sure).

> I missed that I need to manually unwrap it again before actually
> sending the email. Although the CI checkpatch also pointed this out in it's
> own obscure manner.

...

> > > Cc: Matthew Brost 
> > > Cc: John Harrison 
> > > Cc: Jani Nikula 
> > > Cc: Joonas Lahtinen 
> > > Cc: Rodrigo Vivi 
> > > Cc: Tvrtko Ursulin 
> > > Cc: Daniele Ceraolo Spurio 
> > > Cc: Andrzej Hajda 
> > > Cc: Chris Wilson 
> > > Cc: Matthew Auld 
> > > Cc: Matt Roper 
> > > Cc: Umesh Nerlige Ramappa 
> > > Cc: Michael Cheng 
> > > Cc: Lucas De Marchi 
> > > Cc: Tejas Upadhyay 
> > > Cc: Andy Shevchenko 
> > > Cc: Aravind Iddamsetty 
> > > Cc: Alan Previn 
> > > Cc: Bruce Chang 
> > > Cc: intel-gfx@lists.freedesktop.org
> > Is it possible to utilize --to --cc parameters to git send-email instead of
> > noisy Cc list?
> This is the list auto-generated by the 'dim fixes' tool. I am told this is
> the officially correct way to create a fixes patch - copy the output from
> 'dim fixes' as is into the patch headers.

Okay, so it may be question to the `dim` tool then...

...

> > Stray change.
> Intentional change to improve the readability of a function that is being
> modified by other changes in this patch.

But not described in the commit message. That's why "stray".

-- 
With Best Regards,
Andy Shevchenko




Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump

2023-01-18 Thread John Harrison

On 1/18/2023 00:29, Andy Shevchenko wrote:

On Tue, Jan 17, 2023 at 01:36:26PM -0800, john.c.harri...@intel.com wrote:

From: John Harrison 

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count. So no change to that
code itself but the context version does change.

The only other caller is the code for dumping engine state to debugfs.
That code wasn't previously getting an explicit reference at all as it
does everything while holding the execlist specific spinlock. So that
needs updaing as well as that spinlock doesn't help when using GuC
submission. Rather than trying to conditionally get/put depending on
submission model, just change it to always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up too.
Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
with GuC")

Must be one line.
In my tree it is one line. git itself does the line wrap when creating 
the email. I missed that I need to manually unwrap it again before 
actually sending the email. Although the CI checkpatch also pointed this 
out in it's own obscure manner.





Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
Cc: Matthew Brost 
Cc: John Harrison 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Daniele Ceraolo Spurio 
Cc: Andrzej Hajda 
Cc: Chris Wilson 
Cc: Matthew Auld 
Cc: Matt Roper 
Cc: Umesh Nerlige Ramappa 
Cc: Michael Cheng 
Cc: Lucas De Marchi 
Cc: Tejas Upadhyay 
Cc: Andy Shevchenko 
Cc: Aravind Iddamsetty 
Cc: Alan Previn 
Cc: Bruce Chang 
Cc: intel-gfx@lists.freedesktop.org

Is it possible to utilize --to --cc parameters to git send-email instead of
noisy Cc list?
This is the list auto-generated by the 'dim fixes' tool. I am told this 
is the officially correct way to create a fixes patch - copy the output 
from 'dim fixes' as is into the patch headers.



...


+   if (hung_rq)
+   i915_request_put(hung_rq);

In Linux kernel the idiom is that freeing resources APIs should be NULL-aware
(or ERR_PTR aware or both). Does i915 follows that? If so, the test should be
inside i915_request_put() rather than in any of the callers.
That's as maybe. But this is how i915_request_put() currently works - it 
is simply a wrapper around 'dma_fence_put(&rq->fence);'. So passing in a 
null pointer will immediately cause a null pointer deref. If you want 
the put implementation to change and to re-work all its callers, that 
should be done in a separate patch and not piled on top of other changes.




...


@@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs 
*engine)
xa_lock(&guc->context_lookup);
goto done;
}
+
  next:
intel_context_put(ce);
xa_lock(&guc->context_lookup);

Stray change.
Intentional change to improve the readability of a function that is 
being modified by other changes in this patch.


John.







Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump

2023-01-18 Thread Tvrtko Ursulin



On 17/01/2023 21:36, john.c.harri...@intel.com wrote:

From: John Harrison 

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count. So no change to that
code itself but the context version does change.

The only other caller is the code for dumping engine state to debugfs.
That code wasn't previously getting an explicit reference at all as it
does everything while holding the execlist specific spinlock. So that
needs updaing as well as that spinlock doesn't help when using GuC
submission. Rather than trying to conditionally get/put depending on
submission model, just change it to always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up too.

Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
with GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
Cc: Matthew Brost 
Cc: John Harrison 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Daniele Ceraolo Spurio 
Cc: Andrzej Hajda 
Cc: Chris Wilson 
Cc: Matthew Auld 
Cc: Matt Roper 
Cc: Umesh Nerlige Ramappa 
Cc: Michael Cheng 
Cc: Lucas De Marchi 
Cc: Tejas Upadhyay 
Cc: Andy Shevchenko 
Cc: Aravind Iddamsetty 
Cc: Alan Previn 
Cc: Bruce Chang 
Cc: intel-gfx@lists.freedesktop.org
Signed-off-by: John Harrison 
---
  drivers/gpu/drm/i915/gt/intel_context.c   |  1 +
  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  7 ++-
  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++
  drivers/gpu/drm/i915/i915_gpu_error.c |  5 ++---
  4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
b/drivers/gpu/drm/i915/gt/intel_context.c
index e94365b08f1ef..df64cf1954c1d 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -552,6 +552,7 @@ struct i915_request 
*intel_context_find_active_request(struct intel_context *ce)
  
  		active = rq;

}
+   active = i915_request_get_rcu(active);
spin_unlock_irqrestore(&parent->guc_state.lock, flags);
  
  	return active;

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 922f1bb22dc68..517d1fb7ae333 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2236,10 +2236,13 @@ static void engine_dump_active_requests(struct 
intel_engine_cs *engine, struct d
guc = intel_uc_uses_guc_submission(&engine->gt->uc);
if (guc) {
ce = intel_engine_get_hung_context(engine);
-   if (ce)
+   if (ce) {
+   /* This will reference count the request (if found) */
hung_rq = intel_context_find_active_request(ce);
+   }
} else {
hung_rq = intel_engine_execlist_find_hung_request(engine);
+   hung_rq = i915_request_get_rcu(hung_rq);


Looks like intel_engine_execlist_find_hung_request can return NULL which 
i915_request_get_rcu will not handle.


Maybe it would come up simpler if intel_context_find_active_request 
wouldn't be getting the reference and then you can get one here at a 
single place for both branches?



}
  
  	if (hung_rq)

@@ -2250,6 +2253,8 @@ static void engine_dump_active_requests(struct 
intel_engine_cs *engine, struct d
else

intel_engine_dump_active_requests(&engine->sched_engine->requests,
  hung_rq, m);
+   if (hung_rq)
+   i915_request_put(hung_rq);
  }
  
  void intel_engine_dump(struct intel_engine_cs *engine,

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b436dd7f12e42..3b34a82d692be 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs 
*engine)
  
  	xa_lock_irqsave(&guc->context_lookup, flags);

xa_for_each(&guc->context_lookup, index, ce) {
+   bool found;
+
if (!kref_get_unless_zero(&ce->ref))
continue;
  
@@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)

goto next;
}
  
+		found = false;

+   spin_lock(&ce->guc_state.lock);
list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
if (i915_test_request_state(rq) != I915_REQUEST_ACTIV

Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump

2023-01-18 Thread Andy Shevchenko
On Tue, Jan 17, 2023 at 01:36:26PM -0800, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> When GuC support was added to error capture, the locking around the
> request object was broken. Fix it up.
> 
> The context based search manages the spinlocking around the search
> internally. So it needs to grab the reference count internally as
> well. The execlist only request based search relies on external
> locking, so it needs an external reference count. So no change to that
> code itself but the context version does change.
> 
> The only other caller is the code for dumping engine state to debugfs.
> That code wasn't previously getting an explicit reference at all as it
> does everything while holding the execlist specific spinlock. So that
> needs updaing as well as that spinlock doesn't help when using GuC
> submission. Rather than trying to conditionally get/put depending on
> submission model, just change it to always do the get/put.
> 
> In addition, intel_guc_find_hung_context() was not acquiring the
> correct spinlock before searching the request list. So fix that up too.

> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
> with GuC")

Must be one line.

> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")

> Cc: Matthew Brost 
> Cc: John Harrison 
> Cc: Jani Nikula 
> Cc: Joonas Lahtinen 
> Cc: Rodrigo Vivi 
> Cc: Tvrtko Ursulin 
> Cc: Daniele Ceraolo Spurio 
> Cc: Andrzej Hajda 
> Cc: Chris Wilson 
> Cc: Matthew Auld 
> Cc: Matt Roper 
> Cc: Umesh Nerlige Ramappa 
> Cc: Michael Cheng 
> Cc: Lucas De Marchi 
> Cc: Tejas Upadhyay 
> Cc: Andy Shevchenko 
> Cc: Aravind Iddamsetty 
> Cc: Alan Previn 
> Cc: Bruce Chang 
> Cc: intel-gfx@lists.freedesktop.org

Is it possible to utilize --to --cc parameters to git send-email instead of
noisy Cc list?

...

> + if (hung_rq)
> + i915_request_put(hung_rq);

In Linux kernel the idiom is that freeing resources APIs should be NULL-aware
(or ERR_PTR aware or both). Does i915 follows that? If so, the test should be
inside i915_request_put() rather than in any of the callers.

...

> @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs 
> *engine)
>   xa_lock(&guc->context_lookup);
>   goto done;
>   }
> +
>  next:
>   intel_context_put(ce);
>   xa_lock(&guc->context_lookup);

Stray change.

-- 
With Best Regards,
Andy Shevchenko




[Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump

2023-01-17 Thread John . C . Harrison
From: John Harrison 

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count. So no change to that
code itself but the context version does change.

The only other caller is the code for dumping engine state to debugfs.
That code wasn't previously getting an explicit reference at all as it
does everything while holding the execlist specific spinlock. So that
needs updaing as well as that spinlock doesn't help when using GuC
submission. Rather than trying to conditionally get/put depending on
submission model, just change it to always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up too.

Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
with GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
Cc: Matthew Brost 
Cc: John Harrison 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Daniele Ceraolo Spurio 
Cc: Andrzej Hajda 
Cc: Chris Wilson 
Cc: Matthew Auld 
Cc: Matt Roper 
Cc: Umesh Nerlige Ramappa 
Cc: Michael Cheng 
Cc: Lucas De Marchi 
Cc: Tejas Upadhyay 
Cc: Andy Shevchenko 
Cc: Aravind Iddamsetty 
Cc: Alan Previn 
Cc: Bruce Chang 
Cc: intel-gfx@lists.freedesktop.org
Signed-off-by: John Harrison 
---
 drivers/gpu/drm/i915/gt/intel_context.c   |  1 +
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  7 ++-
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++
 drivers/gpu/drm/i915/i915_gpu_error.c |  5 ++---
 4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
b/drivers/gpu/drm/i915/gt/intel_context.c
index e94365b08f1ef..df64cf1954c1d 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -552,6 +552,7 @@ struct i915_request 
*intel_context_find_active_request(struct intel_context *ce)
 
active = rq;
}
+   active = i915_request_get_rcu(active);
spin_unlock_irqrestore(&parent->guc_state.lock, flags);
 
return active;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 922f1bb22dc68..517d1fb7ae333 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2236,10 +2236,13 @@ static void engine_dump_active_requests(struct 
intel_engine_cs *engine, struct d
guc = intel_uc_uses_guc_submission(&engine->gt->uc);
if (guc) {
ce = intel_engine_get_hung_context(engine);
-   if (ce)
+   if (ce) {
+   /* This will reference count the request (if found) */
hung_rq = intel_context_find_active_request(ce);
+   }
} else {
hung_rq = intel_engine_execlist_find_hung_request(engine);
+   hung_rq = i915_request_get_rcu(hung_rq);
}
 
if (hung_rq)
@@ -2250,6 +2253,8 @@ static void engine_dump_active_requests(struct 
intel_engine_cs *engine, struct d
else

intel_engine_dump_active_requests(&engine->sched_engine->requests,
  hung_rq, m);
+   if (hung_rq)
+   i915_request_put(hung_rq);
 }
 
 void intel_engine_dump(struct intel_engine_cs *engine,
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b436dd7f12e42..3b34a82d692be 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs 
*engine)
 
xa_lock_irqsave(&guc->context_lookup, flags);
xa_for_each(&guc->context_lookup, index, ce) {
+   bool found;
+
if (!kref_get_unless_zero(&ce->ref))
continue;
 
@@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs 
*engine)
goto next;
}
 
+   found = false;
+   spin_lock(&ce->guc_state.lock);
list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
continue;
 
+   found = true;
+   break;
+   }
+   spin_unlock(&ce->guc_state.lock);
+
+   if (found) {
intel_engine_set_hung_context(engine, ce);
 
/* Can only cope wit