Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-09-06 Thread John Harrison

On 9/6/2023 02:17, Andi Shyti wrote:

Hi John,


 static void guc_cancel_busyness_worker(struct intel_guc *guc)
 {
-   cancel_delayed_work_sync(>timestamp.work);
+   /*
+* When intel_gt_reset was called, task will hold a lock.
+* To cacel delayed work here, the _sync version will also acquire a 
lock, which might
+* trigger the possible cirular locking dependency warning.
+* Check the reset_in_progress flag, call async verion if reset is in 
progress.
+*/

This needs to explain in much more detail what is going on and why it is not
a problem. E.g.:

  The busyness worker needs to be cancelled. In general that means
  using the synchronous cancel version to ensure that an in-progress
  worker will not keep executing beyond whatever is happening that
  needs the cancel. E.g. suspend, driver unload, etc. However, in the
  case of a reset, the synchronous version is not required and can
  trigger a false deadlock detection warning.

  The business worker takes the reset mutex to protect against resets
  interfering with it. However, it does a trylock and bails out if the
  reset lock is already acquired. Thus there is no actual deadlock or
  other concern with the worker running concurrently with a reset. So
  an asynchronous cancel is safe in the case of a reset rather than a
  driver unload or suspend type operation. On the other hand, if the
  cancel_sync version is used when a reset is in progress then the
  mutex deadlock detection sees the mutex being acquired through
  multiple paths and complains.

  So just don't bother. That keeps the detection code happy and is
  safe because of the trylock code described above.

So why do we even need to cancel anything if it doesn't do anything while
the reset is in progress?

It still needs to be cancelled. The worker only aborts if it is actively
executing concurrently with the reset. It might not start to execute until
after the reset has completed. And there is presumably a reason why the
cancel is being called, a reason not necessarily related to resets at all.
Leaving the worker to run arbitrarily after the driver is expecting it to be
stopped will lead to much worse things than a fake lockdep splat, e.g. a use
after free pointer deref.

I was actually thinking why not leave things as they are and just
disable lockdep from CI. This doesn't look like a relevant report
to me.

Andi

Disable lockdep? The whole of lockdep? We absolutely do not want to disable
an extremely important deadlock testing infrastructure in our test
framework. That would be defeating the whole point of CI.

Potentially we could annotate this one particular scenario to suppress this
one particular error.  But it seems simpler and safer to just update the
code to not hit that scenario in the first place.

yes... lockdep is a debug tool and might provide false reports...
We need to have a great willingness to start fixing and hunting
debug lockdep's false positives (like this one, for instance).
That is how lockdep works. It's like a compiler warning. You have to fix 
them even if you think they don't matter. Because otherwise, when 
someone tries to turn warnings on, they drown in a sea of other people's 
unrelated garbage that they did not bother to fix. If lockdep is to be 
of any use at all then it must be run regularly as part of a CI type 
system and any issues it finds must be fixed up by the developer's that 
own the relevant code. Where fixing means either fixing genuine bugs, 
re-working the code to not hit a false positive or annotating the code 
to explain to lockdep why it is a safe operation.




It's even more annoying to reduce our CI pass rates, especially
when in BAT tests, with such false deadlocks.
Maybe. But it is even more annoying when you have a genuine locking 
issue that you don't notice because you have disabled lockdep and just 
have some random hang issue that is impossible to reproduce or debug.




It's the developer's responsibility to test its code with
debug_lockdep and fix all the potential deadlocks and ignore the
false ones.
You seem to have this backwards. Developers are not expected to run 
every possible test on every possible platform in every possible 
configuration. That is the job of CI.


John.


I sent a patch for this[*] already.

Andi

[*] https://gitlab.freedesktop.org/gfx-ci/i915-infra/-/merge_requests/128




Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-09-06 Thread John Harrison

On 9/5/2023 23:50, Daniel Vetter wrote:

On Mon, Aug 28, 2023 at 04:01:38PM -0700, John Harrison wrote:

On 8/23/2023 10:37, John Harrison wrote:

On 8/23/2023 09:00, Daniel Vetter wrote:

On Tue, Aug 22, 2023 at 11:53:24AM -0700, John Harrison wrote:

On 8/11/2023 11:20, Zhanjun Dong wrote:

This attempts to avoid circular locking dependency between
flush delayed
work and intel_gt_reset.
When intel_gt_reset was called, task will hold a lock.
To cacel delayed work here, the _sync version will also
acquire a lock,
which might trigger the possible cirular locking dependency warning.
When intel_gt_reset called, reset_in_progress flag will be
set, add code
to check the flag, call async verion if reset is in progress.

Signed-off-by: Zhanjun Dong
Cc: John Harrison
Cc: Andi Shyti
Cc: Daniel Vetter
---
    drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
    1 file changed, 10 insertions(+), 1 deletion(-)

diff --git
a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index a0e3ef1c65d2..600388c849f7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1359,7 +1359,16 @@ static void
guc_enable_busyness_worker(struct intel_guc *guc)
    static void guc_cancel_busyness_worker(struct intel_guc *guc)
    {
-    cancel_delayed_work_sync(>timestamp.work);
+    /*
+ * When intel_gt_reset was called, task will hold a lock.
+ * To cacel delayed work here, the _sync version will
also acquire a lock, which might
+ * trigger the possible cirular locking dependency warning.
+ * Check the reset_in_progress flag, call async verion
if reset is in progress.
+ */

This needs to explain in much more detail what is going on and
why it is not
a problem. E.g.:

     The busyness worker needs to be cancelled. In general that means
     using the synchronous cancel version to ensure that an in-progress
     worker will not keep executing beyond whatever is happening that
     needs the cancel. E.g. suspend, driver unload, etc. However, in the
     case of a reset, the synchronous version is not required and can
     trigger a false deadlock detection warning.

     The business worker takes the reset mutex to protect against resets
     interfering with it. However, it does a trylock and bails
out if the
     reset lock is already acquired. Thus there is no actual deadlock or
     other concern with the worker running concurrently with a reset. So
     an asynchronous cancel is safe in the case of a reset rather than a
     driver unload or suspend type operation. On the other hand, if the
     cancel_sync version is used when a reset is in progress then the
     mutex deadlock detection sees the mutex being acquired through
     multiple paths and complains.

     So just don't bother. That keeps the detection code happy and is
     safe because of the trylock code described above.

So why do we even need to cancel anything if it doesn't do anything
while
the reset is in progress?

It still needs to be cancelled. The worker only aborts if it is actively
executing concurrently with the reset. It might not start to execute
until after the reset has completed. And there is presumably a reason
why the cancel is being called, a reason not necessarily related to
resets at all. Leaving the worker to run arbitrarily after the driver is
expecting it to be stopped will lead to much worse things than a fake
lockdep splat, e.g. a use after free pointer deref.

John.

@Daniel Vetter - ping? Is this explanation sufficient? Are you okay with
this change now?

Sorry for the late reply, I'm constantly behind on mails :-/ Ping me on
irc next time around if I don't reply, that's quicker.

"presumably" isn't good enough for locking design. Either you know, and
can prove it all, or you shouldn't touch the code and its locking design
before you've figured this out.

Again, either this is a deadlock, race condition, or the cancel isn't
necessary. And this argument works in full generality. All this patch does
it replace the dealock with one of the other two, and that's not good
enough if you don't even know which one it is.

- if you need the cancel, you have a race condition

- if you don't have a race condition, you don't need the cancel
In the case of a reset in progress, we do not strictly need the cancel. 
The worker thread will take care of avoiding a deadlock by itself. But 
it is more efficient to do the cancel and avoid unnecessary code 
execution if possible. It is also more logically correct - the worker is 
being stopped, therefore we should cancel any pending execution of the 
worker.


In the case of a reset not being in progress, we absolutely do need the 
cancel as there are multiple race conditions.




- currently you have the deadlock

No, we do not. There is no deadlock.

The worker thread explicitly does a trylock and reschedules itself for 
later if it could not get the lock. 

Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-09-06 Thread Daniel Vetter
On Wed, Sep 06, 2023 at 01:04:06PM +0300, Jani Nikula wrote:
> On Wed, 06 Sep 2023, Andi Shyti  wrote:
> > It's the developer's responsibility to test its code with
> > debug_lockdep and fix all the potential deadlocks and ignore the
> > false ones.
> 
> No. Manual validation of lockdep reports is not feasible. Lockdep is the
> tool to validate locking. It's the developer's responsibility to make
> lockdep understand the design.

Yeah I guess I need to drop my locking design principle once more:

If lockdep doesn't understand your locking design, your design is shit.

You need to fix the design, not play whack-a-mole with lockdep. Or worse,
pretend there's no problem and just disable lockdep outright.

If you don't understand your design, and can't succinctly explain it (or
demonstrate the full hierarchy with lockdep priming, testing in CI isn't
good enough for anything remotely complex), then you have a _really_ big
problem. Yes CI is good at catching accidental changes in locking design,
but if you use it for anything more than that you're in deep trouble.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-09-06 Thread Jani Nikula
On Wed, 06 Sep 2023, Andi Shyti  wrote:
>> > I was actually thinking why not leave things as they are and just
>> > disable lockdep from CI. This doesn't look like a relevant report
>> > to me.
>> > 
>> > Andi
>> Disable lockdep? The whole of lockdep? We absolutely do not want to disable
>> an extremely important deadlock testing infrastructure in our test
>> framework. That would be defeating the whole point of CI.
>> 
>> Potentially we could annotate this one particular scenario to suppress this
>> one particular error.  But it seems simpler and safer to just update the
>> code to not hit that scenario in the first place.
>
> yes... lockdep is a debug tool and might provide false reports...
> We need to have a great willingness to start fixing and hunting
> debug lockdep's false positives (like this one, for instance).
>
> It's even more annoying to reduce our CI pass rates, especially
> when in BAT tests, with such false deadlocks.

Make lockdep understand what you're doing, and there are no false
positives. That's all there is to it.

> It's the developer's responsibility to test its code with
> debug_lockdep and fix all the potential deadlocks and ignore the
> false ones.

No. Manual validation of lockdep reports is not feasible. Lockdep is the
tool to validate locking. It's the developer's responsibility to make
lockdep understand the design.

Besides, locking is often subtle. Stuff can change as a side effect even
when you're not intentionally changing locking, e.g. during
refactoring. What you're suggesting effectively means all developers
should run all of igt on a bunch of different generations of machines
with lockdep enabled. Realistically, not going to happen, and we have CI
because of this.

> I sent a patch for this[*] already.
>
> Andi
>
> [*] https://gitlab.freedesktop.org/gfx-ci/i915-infra/-/merge_requests/128

Yeah, no.


BR,
Jani.


-- 
Jani Nikula, Intel Open Source Graphics Center


Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-09-06 Thread Andi Shyti
Hi John,

> > > > > > static void guc_cancel_busyness_worker(struct intel_guc *guc)
> > > > > > {
> > > > > > -   cancel_delayed_work_sync(>timestamp.work);
> > > > > > +   /*
> > > > > > +* When intel_gt_reset was called, task will hold a lock.
> > > > > > +* To cacel delayed work here, the _sync version will also 
> > > > > > acquire a lock, which might
> > > > > > +* trigger the possible cirular locking dependency warning.
> > > > > > +* Check the reset_in_progress flag, call async verion if reset 
> > > > > > is in progress.
> > > > > > +*/
> > > > > This needs to explain in much more detail what is going on and why it 
> > > > > is not
> > > > > a problem. E.g.:
> > > > > 
> > > > >  The busyness worker needs to be cancelled. In general that means
> > > > >  using the synchronous cancel version to ensure that an 
> > > > > in-progress
> > > > >  worker will not keep executing beyond whatever is happening that
> > > > >  needs the cancel. E.g. suspend, driver unload, etc. However, in 
> > > > > the
> > > > >  case of a reset, the synchronous version is not required and can
> > > > >  trigger a false deadlock detection warning.
> > > > > 
> > > > >  The business worker takes the reset mutex to protect against 
> > > > > resets
> > > > >  interfering with it. However, it does a trylock and bails out if 
> > > > > the
> > > > >  reset lock is already acquired. Thus there is no actual deadlock 
> > > > > or
> > > > >  other concern with the worker running concurrently with a reset. 
> > > > > So
> > > > >  an asynchronous cancel is safe in the case of a reset rather 
> > > > > than a
> > > > >  driver unload or suspend type operation. On the other hand, if 
> > > > > the
> > > > >  cancel_sync version is used when a reset is in progress then the
> > > > >  mutex deadlock detection sees the mutex being acquired through
> > > > >  multiple paths and complains.
> > > > > 
> > > > >  So just don't bother. That keeps the detection code happy and is
> > > > >  safe because of the trylock code described above.
> > > > So why do we even need to cancel anything if it doesn't do anything 
> > > > while
> > > > the reset is in progress?
> > > It still needs to be cancelled. The worker only aborts if it is actively
> > > executing concurrently with the reset. It might not start to execute until
> > > after the reset has completed. And there is presumably a reason why the
> > > cancel is being called, a reason not necessarily related to resets at all.
> > > Leaving the worker to run arbitrarily after the driver is expecting it to 
> > > be
> > > stopped will lead to much worse things than a fake lockdep splat, e.g. a 
> > > use
> > > after free pointer deref.
> > I was actually thinking why not leave things as they are and just
> > disable lockdep from CI. This doesn't look like a relevant report
> > to me.
> > 
> > Andi
> Disable lockdep? The whole of lockdep? We absolutely do not want to disable
> an extremely important deadlock testing infrastructure in our test
> framework. That would be defeating the whole point of CI.
> 
> Potentially we could annotate this one particular scenario to suppress this
> one particular error.  But it seems simpler and safer to just update the
> code to not hit that scenario in the first place.

yes... lockdep is a debug tool and might provide false reports...
We need to have a great willingness to start fixing and hunting
debug lockdep's false positives (like this one, for instance).

It's even more annoying to reduce our CI pass rates, especially
when in BAT tests, with such false deadlocks.

It's the developer's responsibility to test its code with
debug_lockdep and fix all the potential deadlocks and ignore the
false ones.

I sent a patch for this[*] already.

Andi

[*] https://gitlab.freedesktop.org/gfx-ci/i915-infra/-/merge_requests/128


Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-09-06 Thread Daniel Vetter
On Mon, Aug 28, 2023 at 04:01:38PM -0700, John Harrison wrote:
> On 8/23/2023 10:37, John Harrison wrote:
> > On 8/23/2023 09:00, Daniel Vetter wrote:
> > > On Tue, Aug 22, 2023 at 11:53:24AM -0700, John Harrison wrote:
> > > > On 8/11/2023 11:20, Zhanjun Dong wrote:
> > > > > This attempts to avoid circular locking dependency between
> > > > > flush delayed
> > > > > work and intel_gt_reset.
> > > > > When intel_gt_reset was called, task will hold a lock.
> > > > > To cacel delayed work here, the _sync version will also
> > > > > acquire a lock,
> > > > > which might trigger the possible cirular locking dependency warning.
> > > > > When intel_gt_reset called, reset_in_progress flag will be
> > > > > set, add code
> > > > > to check the flag, call async verion if reset is in progress.
> > > > > 
> > > > > Signed-off-by: Zhanjun Dong
> > > > > Cc: John Harrison
> > > > > Cc: Andi Shyti
> > > > > Cc: Daniel Vetter
> > > > > ---
> > > > >    drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
> > > > >    1 file changed, 10 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git
> > > > > a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > index a0e3ef1c65d2..600388c849f7 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > > @@ -1359,7 +1359,16 @@ static void
> > > > > guc_enable_busyness_worker(struct intel_guc *guc)
> > > > >    static void guc_cancel_busyness_worker(struct intel_guc *guc)
> > > > >    {
> > > > > -    cancel_delayed_work_sync(>timestamp.work);
> > > > > +    /*
> > > > > + * When intel_gt_reset was called, task will hold a lock.
> > > > > + * To cacel delayed work here, the _sync version will
> > > > > also acquire a lock, which might
> > > > > + * trigger the possible cirular locking dependency warning.
> > > > > + * Check the reset_in_progress flag, call async verion
> > > > > if reset is in progress.
> > > > > + */
> > > > This needs to explain in much more detail what is going on and
> > > > why it is not
> > > > a problem. E.g.:
> > > > 
> > > >     The busyness worker needs to be cancelled. In general that means
> > > >     using the synchronous cancel version to ensure that an in-progress
> > > >     worker will not keep executing beyond whatever is happening that
> > > >     needs the cancel. E.g. suspend, driver unload, etc. However, in the
> > > >     case of a reset, the synchronous version is not required and can
> > > >     trigger a false deadlock detection warning.
> > > > 
> > > >     The business worker takes the reset mutex to protect against resets
> > > >     interfering with it. However, it does a trylock and bails
> > > > out if the
> > > >     reset lock is already acquired. Thus there is no actual deadlock or
> > > >     other concern with the worker running concurrently with a reset. So
> > > >     an asynchronous cancel is safe in the case of a reset rather than a
> > > >     driver unload or suspend type operation. On the other hand, if the
> > > >     cancel_sync version is used when a reset is in progress then the
> > > >     mutex deadlock detection sees the mutex being acquired through
> > > >     multiple paths and complains.
> > > > 
> > > >     So just don't bother. That keeps the detection code happy and is
> > > >     safe because of the trylock code described above.
> > > So why do we even need to cancel anything if it doesn't do anything
> > > while
> > > the reset is in progress?
> > It still needs to be cancelled. The worker only aborts if it is actively
> > executing concurrently with the reset. It might not start to execute
> > until after the reset has completed. And there is presumably a reason
> > why the cancel is being called, a reason not necessarily related to
> > resets at all. Leaving the worker to run arbitrarily after the driver is
> > expecting it to be stopped will lead to much worse things than a fake
> > lockdep splat, e.g. a use after free pointer deref.
> > 
> > John.
> @Daniel Vetter - ping? Is this explanation sufficient? Are you okay with
> this change now?

Sorry for the late reply, I'm constantly behind on mails :-/ Ping me on
irc next time around if I don't reply, that's quicker.

"presumably" isn't good enough for locking design. Either you know, and
can prove it all, or you shouldn't touch the code and its locking design
before you've figured this out.

Again, either this is a deadlock, race condition, or the cancel isn't
necessary. And this argument works in full generality. All this patch does
it replace the dealock with one of the other two, and that's not good
enough if you don't even know which one it is.

- if you need the cancel, you have a race condition

- if you don't have a race condition, you don't need the cancel

- currently you have the deadlock

"presumably" and "maybe" aint enoug for locking design.


Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-31 Thread John Harrison

On 8/31/2023 07:00, Andi Shyti wrote:

Hi,


diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index a0e3ef1c65d2..600388c849f7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct intel_guc 
*guc)
static void guc_cancel_busyness_worker(struct intel_guc *guc)
{
-   cancel_delayed_work_sync(>timestamp.work);
+   /*
+* When intel_gt_reset was called, task will hold a lock.
+* To cacel delayed work here, the _sync version will also acquire a 
lock, which might
+* trigger the possible cirular locking dependency warning.
+* Check the reset_in_progress flag, call async verion if reset is in 
progress.
+*/

This needs to explain in much more detail what is going on and why it is not
a problem. E.g.:

 The busyness worker needs to be cancelled. In general that means
 using the synchronous cancel version to ensure that an in-progress
 worker will not keep executing beyond whatever is happening that
 needs the cancel. E.g. suspend, driver unload, etc. However, in the
 case of a reset, the synchronous version is not required and can
 trigger a false deadlock detection warning.

 The business worker takes the reset mutex to protect against resets
 interfering with it. However, it does a trylock and bails out if the
 reset lock is already acquired. Thus there is no actual deadlock or
 other concern with the worker running concurrently with a reset. So
 an asynchronous cancel is safe in the case of a reset rather than a
 driver unload or suspend type operation. On the other hand, if the
 cancel_sync version is used when a reset is in progress then the
 mutex deadlock detection sees the mutex being acquired through
 multiple paths and complains.

 So just don't bother. That keeps the detection code happy and is
 safe because of the trylock code described above.

So why do we even need to cancel anything if it doesn't do anything while
the reset is in progress?

It still needs to be cancelled. The worker only aborts if it is actively
executing concurrently with the reset. It might not start to execute until
after the reset has completed. And there is presumably a reason why the
cancel is being called, a reason not necessarily related to resets at all.
Leaving the worker to run arbitrarily after the driver is expecting it to be
stopped will lead to much worse things than a fake lockdep splat, e.g. a use
after free pointer deref.

I was actually thinking why not leave things as they are and just
disable lockdep from CI. This doesn't look like a relevant report
to me.

Andi
Disable lockdep? The whole of lockdep? We absolutely do not want to 
disable an extremely important deadlock testing infrastructure in our 
test framework. That would be defeating the whole point of CI.


Potentially we could annotate this one particular scenario to suppress 
this one particular error.  But it seems simpler and safer to just 
update the code to not hit that scenario in the first place.


John.



Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-31 Thread Andi Shyti
Hi,

> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
> > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index a0e3ef1c65d2..600388c849f7 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct 
> > > > intel_guc *guc)
> > > >static void guc_cancel_busyness_worker(struct intel_guc *guc)
> > > >{
> > > > -   cancel_delayed_work_sync(>timestamp.work);
> > > > +   /*
> > > > +* When intel_gt_reset was called, task will hold a lock.
> > > > +* To cacel delayed work here, the _sync version will also 
> > > > acquire a lock, which might
> > > > +* trigger the possible cirular locking dependency warning.
> > > > +* Check the reset_in_progress flag, call async verion if reset 
> > > > is in progress.
> > > > +*/
> > > This needs to explain in much more detail what is going on and why it is 
> > > not
> > > a problem. E.g.:
> > > 
> > > The busyness worker needs to be cancelled. In general that means
> > > using the synchronous cancel version to ensure that an in-progress
> > > worker will not keep executing beyond whatever is happening that
> > > needs the cancel. E.g. suspend, driver unload, etc. However, in the
> > > case of a reset, the synchronous version is not required and can
> > > trigger a false deadlock detection warning.
> > > 
> > > The business worker takes the reset mutex to protect against resets
> > > interfering with it. However, it does a trylock and bails out if the
> > > reset lock is already acquired. Thus there is no actual deadlock or
> > > other concern with the worker running concurrently with a reset. So
> > > an asynchronous cancel is safe in the case of a reset rather than a
> > > driver unload or suspend type operation. On the other hand, if the
> > > cancel_sync version is used when a reset is in progress then the
> > > mutex deadlock detection sees the mutex being acquired through
> > > multiple paths and complains.
> > > 
> > > So just don't bother. That keeps the detection code happy and is
> > > safe because of the trylock code described above.
> > So why do we even need to cancel anything if it doesn't do anything while
> > the reset is in progress?
> It still needs to be cancelled. The worker only aborts if it is actively
> executing concurrently with the reset. It might not start to execute until
> after the reset has completed. And there is presumably a reason why the
> cancel is being called, a reason not necessarily related to resets at all.
> Leaving the worker to run arbitrarily after the driver is expecting it to be
> stopped will lead to much worse things than a fake lockdep splat, e.g. a use
> after free pointer deref.

I was actually thinking why not leave things as they are and just
disable lockdep from CI. This doesn't look like a relevant report
to me.

Andi


Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-29 Thread Andi Shyti
Hi Zhanjun,

On Tue, Aug 22, 2023 at 11:53:24AM -0700, John Harrison wrote:
> On 8/11/2023 11:20, Zhanjun Dong wrote:
> 
> This attempts to avoid circular locking dependency between flush delayed
> work and intel_gt_reset.
> When intel_gt_reset was called, task will hold a lock.
> To cacel delayed work here, the _sync version will also acquire a lock,
> which might trigger the possible cirular locking dependency warning.
> When intel_gt_reset called, reset_in_progress flag will be set, add code
> to check the flag, call async verion if reset is in progress.

I liked the previous commit, it just needed to be wrapped (not in
the dmesg copy-paste part).

> Signed-off-by: Zhanjun Dong 
> Cc: John Harrison 
> Cc: Andi Shyti 
> Cc: Daniel Vetter 
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index a0e3ef1c65d2..600388c849f7 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct 
> intel_guc *guc)
> 
>  static void guc_cancel_busyness_worker(struct intel_guc *guc)
>  {
> -   cancel_delayed_work_sync(>timestamp.work);
> +   /*
> +* When intel_gt_reset was called, task will hold a lock.
> +* To cacel delayed work here, the _sync version will also 
> acquire a lock, which might
> +* trigger the possible cirular locking dependency warning.
> +* Check the reset_in_progress flag, call async verion if reset 
> is in progress.
> +*/

Indeed the commit message is a bit misleading and it raises some
alarms if explained it this way.

> This needs to explain in much more detail what is going on and why it is not a
> problem. E.g.:
> 
> The busyness worker needs to be cancelled. In general that means using the
> synchronous cancel version to ensure that an in-progress worker will not
> keep executing beyond whatever is happening that needs the cancel. E.g.
> suspend, driver unload, etc. However, in the case of a reset, the
> synchronous version is not required and can trigger a false deadlock
> detection warning.
> 
> The business worker takes the reset mutex to protect against resets
> interfering with it. However, it does a trylock and bails out if the reset
> lock is already acquired. Thus there is no actual deadlock or other 
> concern
> with the worker running concurrently with a reset. So an asynchronous
> cancel is safe in the case of a reset rather than a driver unload or
> suspend type operation. On the other hand, if the cancel_sync version is
> used when a reset is in progress then the mutex deadlock detection sees 
> the
> mutex being acquired through multiple paths and complains.
> 
> So just don't bother. That keeps the detection code happy and is safe
> because of the trylock code described above.

Can you please update the commit message with John's suggestion?

Is there any further question on this?

Andi


Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-28 Thread John Harrison

On 8/23/2023 10:37, John Harrison wrote:

On 8/23/2023 09:00, Daniel Vetter wrote:

On Tue, Aug 22, 2023 at 11:53:24AM -0700, John Harrison wrote:

On 8/11/2023 11:20, Zhanjun Dong wrote:
This attempts to avoid circular locking dependency between flush 
delayed

work and intel_gt_reset.
When intel_gt_reset was called, task will hold a lock.
To cacel delayed work here, the _sync version will also acquire a 
lock,

which might trigger the possible cirular locking dependency warning.
When intel_gt_reset called, reset_in_progress flag will be set, add 
code

to check the flag, call async verion if reset is in progress.

Signed-off-by: Zhanjun Dong
Cc: John Harrison
Cc: Andi Shyti
Cc: Daniel Vetter
---
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
   1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

index a0e3ef1c65d2..600388c849f7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1359,7 +1359,16 @@ static void 
guc_enable_busyness_worker(struct intel_guc *guc)

   static void guc_cancel_busyness_worker(struct intel_guc *guc)
   {
-    cancel_delayed_work_sync(>timestamp.work);
+    /*
+ * When intel_gt_reset was called, task will hold a lock.
+ * To cacel delayed work here, the _sync version will also 
acquire a lock, which might

+ * trigger the possible cirular locking dependency warning.
+ * Check the reset_in_progress flag, call async verion if 
reset is in progress.

+ */
This needs to explain in much more detail what is going on and why 
it is not

a problem. E.g.:

    The busyness worker needs to be cancelled. In general that means
    using the synchronous cancel version to ensure that an in-progress
    worker will not keep executing beyond whatever is happening that
    needs the cancel. E.g. suspend, driver unload, etc. However, in the
    case of a reset, the synchronous version is not required and can
    trigger a false deadlock detection warning.

    The business worker takes the reset mutex to protect against resets
    interfering with it. However, it does a trylock and bails out if 
the

    reset lock is already acquired. Thus there is no actual deadlock or
    other concern with the worker running concurrently with a reset. So
    an asynchronous cancel is safe in the case of a reset rather than a
    driver unload or suspend type operation. On the other hand, if the
    cancel_sync version is used when a reset is in progress then the
    mutex deadlock detection sees the mutex being acquired through
    multiple paths and complains.

    So just don't bother. That keeps the detection code happy and is
    safe because of the trylock code described above.
So why do we even need to cancel anything if it doesn't do anything 
while

the reset is in progress?
It still needs to be cancelled. The worker only aborts if it is 
actively executing concurrently with the reset. It might not start to 
execute until after the reset has completed. And there is presumably a 
reason why the cancel is being called, a reason not necessarily 
related to resets at all. Leaving the worker to run arbitrarily after 
the driver is expecting it to be stopped will lead to much worse 
things than a fake lockdep splat, e.g. a use after free pointer deref.


John.
@Daniel Vetter - ping? Is this explanation sufficient? Are you okay with 
this change now?


John.





Just remove the cancel from the reset path as uneeded instead, and 
explain

why that's ok? Because that's defacto what the cancel_work with a
potential deadlock scenario for cancel_work_sync does, you either don't
need it at all, or the replacement creates a bug.
-Daniel



John.



+    if (guc_to_gt(guc)->uc.reset_in_progress)
+    cancel_delayed_work(>timestamp.work);
+    else
+ cancel_delayed_work_sync(>timestamp.work);
   }
   static void __reset_guc_busyness_stats(struct intel_guc *guc)






Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-23 Thread John Harrison

On 8/23/2023 09:00, Daniel Vetter wrote:

On Tue, Aug 22, 2023 at 11:53:24AM -0700, John Harrison wrote:

On 8/11/2023 11:20, Zhanjun Dong wrote:

This attempts to avoid circular locking dependency between flush delayed
work and intel_gt_reset.
When intel_gt_reset was called, task will hold a lock.
To cacel delayed work here, the _sync version will also acquire a lock,
which might trigger the possible cirular locking dependency warning.
When intel_gt_reset called, reset_in_progress flag will be set, add code
to check the flag, call async verion if reset is in progress.

Signed-off-by: Zhanjun Dong
Cc: John Harrison
Cc: Andi Shyti
Cc: Daniel Vetter
---
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
   1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index a0e3ef1c65d2..600388c849f7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct intel_guc 
*guc)
   static void guc_cancel_busyness_worker(struct intel_guc *guc)
   {
-   cancel_delayed_work_sync(>timestamp.work);
+   /*
+* When intel_gt_reset was called, task will hold a lock.
+* To cacel delayed work here, the _sync version will also acquire a 
lock, which might
+* trigger the possible cirular locking dependency warning.
+* Check the reset_in_progress flag, call async verion if reset is in 
progress.
+*/

This needs to explain in much more detail what is going on and why it is not
a problem. E.g.:

The busyness worker needs to be cancelled. In general that means
using the synchronous cancel version to ensure that an in-progress
worker will not keep executing beyond whatever is happening that
needs the cancel. E.g. suspend, driver unload, etc. However, in the
case of a reset, the synchronous version is not required and can
trigger a false deadlock detection warning.

The business worker takes the reset mutex to protect against resets
interfering with it. However, it does a trylock and bails out if the
reset lock is already acquired. Thus there is no actual deadlock or
other concern with the worker running concurrently with a reset. So
an asynchronous cancel is safe in the case of a reset rather than a
driver unload or suspend type operation. On the other hand, if the
cancel_sync version is used when a reset is in progress then the
mutex deadlock detection sees the mutex being acquired through
multiple paths and complains.

So just don't bother. That keeps the detection code happy and is
safe because of the trylock code described above.

So why do we even need to cancel anything if it doesn't do anything while
the reset is in progress?
It still needs to be cancelled. The worker only aborts if it is actively 
executing concurrently with the reset. It might not start to execute 
until after the reset has completed. And there is presumably a reason 
why the cancel is being called, a reason not necessarily related to 
resets at all. Leaving the worker to run arbitrarily after the driver is 
expecting it to be stopped will lead to much worse things than a fake 
lockdep splat, e.g. a use after free pointer deref.


John.



Just remove the cancel from the reset path as uneeded instead, and explain
why that's ok? Because that's defacto what the cancel_work with a
potential deadlock scenario for cancel_work_sync does, you either don't
need it at all, or the replacement creates a bug.
-Daniel



John.



+   if (guc_to_gt(guc)->uc.reset_in_progress)
+   cancel_delayed_work(>timestamp.work);
+   else
+   cancel_delayed_work_sync(>timestamp.work);
   }
   static void __reset_guc_busyness_stats(struct intel_guc *guc)




Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-23 Thread Daniel Vetter
On Tue, Aug 22, 2023 at 11:53:24AM -0700, John Harrison wrote:
> On 8/11/2023 11:20, Zhanjun Dong wrote:
> > This attempts to avoid circular locking dependency between flush delayed
> > work and intel_gt_reset.
> > When intel_gt_reset was called, task will hold a lock.
> > To cacel delayed work here, the _sync version will also acquire a lock,
> > which might trigger the possible cirular locking dependency warning.
> > When intel_gt_reset called, reset_in_progress flag will be set, add code
> > to check the flag, call async verion if reset is in progress.
> > 
> > Signed-off-by: Zhanjun Dong
> > Cc: John Harrison
> > Cc: Andi Shyti
> > Cc: Daniel Vetter
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
> >   1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
> > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index a0e3ef1c65d2..600388c849f7 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct 
> > intel_guc *guc)
> >   static void guc_cancel_busyness_worker(struct intel_guc *guc)
> >   {
> > -   cancel_delayed_work_sync(>timestamp.work);
> > +   /*
> > +* When intel_gt_reset was called, task will hold a lock.
> > +* To cacel delayed work here, the _sync version will also acquire a 
> > lock, which might
> > +* trigger the possible cirular locking dependency warning.
> > +* Check the reset_in_progress flag, call async verion if reset is in 
> > progress.
> > +*/
> This needs to explain in much more detail what is going on and why it is not
> a problem. E.g.:
> 
>The busyness worker needs to be cancelled. In general that means
>using the synchronous cancel version to ensure that an in-progress
>worker will not keep executing beyond whatever is happening that
>needs the cancel. E.g. suspend, driver unload, etc. However, in the
>case of a reset, the synchronous version is not required and can
>trigger a false deadlock detection warning.
> 
>The business worker takes the reset mutex to protect against resets
>interfering with it. However, it does a trylock and bails out if the
>reset lock is already acquired. Thus there is no actual deadlock or
>other concern with the worker running concurrently with a reset. So
>an asynchronous cancel is safe in the case of a reset rather than a
>driver unload or suspend type operation. On the other hand, if the
>cancel_sync version is used when a reset is in progress then the
>mutex deadlock detection sees the mutex being acquired through
>multiple paths and complains.
> 
>So just don't bother. That keeps the detection code happy and is
>safe because of the trylock code described above.

So why do we even need to cancel anything if it doesn't do anything while
the reset is in progress?

Just remove the cancel from the reset path as uneeded instead, and explain
why that's ok? Because that's defacto what the cancel_work with a
potential deadlock scenario for cancel_work_sync does, you either don't
need it at all, or the replacement creates a bug.
-Daniel

> 
> 
> John.
> 
> 
> > +   if (guc_to_gt(guc)->uc.reset_in_progress)
> > +   cancel_delayed_work(>timestamp.work);
> > +   else
> > +   cancel_delayed_work_sync(>timestamp.work);
> >   }
> >   static void __reset_guc_busyness_stats(struct intel_guc *guc)

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-22 Thread John Harrison

On 8/11/2023 11:20, Zhanjun Dong wrote:

This attempts to avoid circular locking dependency between flush delayed
work and intel_gt_reset.
When intel_gt_reset was called, task will hold a lock.
To cacel delayed work here, the _sync version will also acquire a lock,
which might trigger the possible cirular locking dependency warning.
When intel_gt_reset called, reset_in_progress flag will be set, add code
to check the flag, call async verion if reset is in progress.

Signed-off-by: Zhanjun Dong
Cc: John Harrison
Cc: Andi Shyti
Cc: Daniel Vetter
---
  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index a0e3ef1c65d2..600388c849f7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct intel_guc 
*guc)
  
  static void guc_cancel_busyness_worker(struct intel_guc *guc)

  {
-   cancel_delayed_work_sync(>timestamp.work);
+   /*
+* When intel_gt_reset was called, task will hold a lock.
+* To cacel delayed work here, the _sync version will also acquire a 
lock, which might
+* trigger the possible cirular locking dependency warning.
+* Check the reset_in_progress flag, call async verion if reset is in 
progress.
+*/
This needs to explain in much more detail what is going on and why it is 
not a problem. E.g.:


   The busyness worker needs to be cancelled. In general that means
   using the synchronous cancel version to ensure that an in-progress
   worker will not keep executing beyond whatever is happening that
   needs the cancel. E.g. suspend, driver unload, etc. However, in the
   case of a reset, the synchronous version is not required and can
   trigger a false deadlock detection warning.

   The business worker takes the reset mutex to protect against resets
   interfering with it. However, it does a trylock and bails out if the
   reset lock is already acquired. Thus there is no actual deadlock or
   other concern with the worker running concurrently with a reset. So
   an asynchronous cancel is safe in the case of a reset rather than a
   driver unload or suspend type operation. On the other hand, if the
   cancel_sync version is used when a reset is in progress then the
   mutex deadlock detection sees the mutex being acquired through
   multiple paths and complains.

   So just don't bother. That keeps the detection code happy and is
   safe because of the trylock code described above.


John.



+   if (guc_to_gt(guc)->uc.reset_in_progress)
+   cancel_delayed_work(>timestamp.work);
+   else
+   cancel_delayed_work_sync(>timestamp.work);
  }
  
  static void __reset_guc_busyness_stats(struct intel_guc *guc)


Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-22 Thread Daniel Vetter
On Tue, Aug 22, 2023 at 02:14:28PM +, Dong, Zhanjun wrote:
> 
> 
> > -Original Message-
> > From: Daniel Vetter 
> > Sent: August 22, 2023 9:51 AM
> > To: Dong, Zhanjun 
> > Cc: intel-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; 
> > Harrison,
> > John C ; Andi Shyti ;
> > Daniel Vetter 
> > Subject: Re: [PATCH v5] drm/i915: Avoid circular locking dependency when
> > flush delayed work on gt reset
> > 
> > On Fri, Aug 11, 2023 at 11:20:11AM -0700, Zhanjun Dong wrote:
> > > This attempts to avoid circular locking dependency between flush delayed
> > > work and intel_gt_reset.
> > > When intel_gt_reset was called, task will hold a lock.
> > > To cacel delayed work here, the _sync version will also acquire a lock,
> > > which might trigger the possible cirular locking dependency warning.
> > > When intel_gt_reset called, reset_in_progress flag will be set, add code
> > > to check the flag, call async verion if reset is in progress.
> > >
> > > Signed-off-by: Zhanjun Dong 
> > > Cc: John Harrison 
> > > Cc: Andi Shyti 
> > > Cc: Daniel Vetter 
> > > ---
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
> > >  1 file changed, 10 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index a0e3ef1c65d2..600388c849f7 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct
> > intel_guc *guc)
> > >
> > >  static void guc_cancel_busyness_worker(struct intel_guc *guc)
> > >  {
> > > - cancel_delayed_work_sync(>timestamp.work);
> > > + /*
> > > +  * When intel_gt_reset was called, task will hold a lock.
> > > +  * To cacel delayed work here, the _sync version will also acquire a 
> > > lock,
> > which might
> > > +  * trigger the possible cirular locking dependency warning.
> > 
> > This is not even close to a locking bugfix. Consider this a formal nack,
> > because the issue here is not even close to "needs more comments to
> > explain what's going on".
> > -Daniel
> 
> The purpose of the comment here it is to explain locking issue condition
> > 
> > > +  * Check the reset_in_progress flag, call async verion if reset is in
> > progress.
> 
> 
> The comment here explains check with the flag to avoid locking condition.
> The reset process is not considered to be complete in short time, other than 
> that, do we missed anything?

Either the _sync is not needed at all, in case you need to explain why.
Which this patch doesn't. And if the _sync isn't needed, then it's
probably not needed in all/most cases?

Or the _sync is needed, and in that case you just replace a potential
deadlock scenario with a potential race condition.

In neither case should this patch here be merged.
-Daniel

> 
> > > +  */
> > > + if (guc_to_gt(guc)->uc.reset_in_progress)
> > > + cancel_delayed_work(>timestamp.work);
> > > + else
> > > + cancel_delayed_work_sync(>timestamp.work);
> > >  }
> > >
> > >  static void __reset_guc_busyness_stats(struct intel_guc *guc)
> > > --
> > > 2.34.1
> > >
> > 
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-22 Thread Dong, Zhanjun


> -Original Message-
> From: Daniel Vetter 
> Sent: August 22, 2023 9:51 AM
> To: Dong, Zhanjun 
> Cc: intel-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; 
> Harrison,
> John C ; Andi Shyti ;
> Daniel Vetter 
> Subject: Re: [PATCH v5] drm/i915: Avoid circular locking dependency when
> flush delayed work on gt reset
> 
> On Fri, Aug 11, 2023 at 11:20:11AM -0700, Zhanjun Dong wrote:
> > This attempts to avoid circular locking dependency between flush delayed
> > work and intel_gt_reset.
> > When intel_gt_reset was called, task will hold a lock.
> > To cacel delayed work here, the _sync version will also acquire a lock,
> > which might trigger the possible cirular locking dependency warning.
> > When intel_gt_reset called, reset_in_progress flag will be set, add code
> > to check the flag, call async verion if reset is in progress.
> >
> > Signed-off-by: Zhanjun Dong 
> > Cc: John Harrison 
> > Cc: Andi Shyti 
> > Cc: Daniel Vetter 
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index a0e3ef1c65d2..600388c849f7 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct
> intel_guc *guc)
> >
> >  static void guc_cancel_busyness_worker(struct intel_guc *guc)
> >  {
> > -   cancel_delayed_work_sync(>timestamp.work);
> > +   /*
> > +* When intel_gt_reset was called, task will hold a lock.
> > +* To cacel delayed work here, the _sync version will also acquire a 
> > lock,
> which might
> > +* trigger the possible cirular locking dependency warning.
> 
> This is not even close to a locking bugfix. Consider this a formal nack,
> because the issue here is not even close to "needs more comments to
> explain what's going on".
> -Daniel

The purpose of the comment here it is to explain locking issue condition
> 
> > +* Check the reset_in_progress flag, call async verion if reset is in
> progress.


The comment here explains check with the flag to avoid locking condition.
The reset process is not considered to be complete in short time, other than 
that, do we missed anything?

> > +*/
> > +   if (guc_to_gt(guc)->uc.reset_in_progress)
> > +   cancel_delayed_work(>timestamp.work);
> > +   else
> > +   cancel_delayed_work_sync(>timestamp.work);
> >  }
> >
> >  static void __reset_guc_busyness_stats(struct intel_guc *guc)
> > --
> > 2.34.1
> >
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-22 Thread Daniel Vetter
On Fri, Aug 11, 2023 at 11:20:11AM -0700, Zhanjun Dong wrote:
> This attempts to avoid circular locking dependency between flush delayed
> work and intel_gt_reset.
> When intel_gt_reset was called, task will hold a lock.
> To cacel delayed work here, the _sync version will also acquire a lock,
> which might trigger the possible cirular locking dependency warning.
> When intel_gt_reset called, reset_in_progress flag will be set, add code
> to check the flag, call async verion if reset is in progress.
> 
> Signed-off-by: Zhanjun Dong 
> Cc: John Harrison 
> Cc: Andi Shyti 
> Cc: Daniel Vetter 
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index a0e3ef1c65d2..600388c849f7 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct 
> intel_guc *guc)
>  
>  static void guc_cancel_busyness_worker(struct intel_guc *guc)
>  {
> - cancel_delayed_work_sync(>timestamp.work);
> + /*
> +  * When intel_gt_reset was called, task will hold a lock.
> +  * To cacel delayed work here, the _sync version will also acquire a 
> lock, which might
> +  * trigger the possible cirular locking dependency warning.

This is not even close to a locking bugfix. Consider this a formal nack,
because the issue here is not even close to "needs more comments to
explain what's going on".
-Daniel

> +  * Check the reset_in_progress flag, call async verion if reset is in 
> progress.
> +  */
> + if (guc_to_gt(guc)->uc.reset_in_progress)
> + cancel_delayed_work(>timestamp.work);
> + else
> + cancel_delayed_work_sync(>timestamp.work);
>  }
>  
>  static void __reset_guc_busyness_stats(struct intel_guc *guc)
> -- 
> 2.34.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-21 Thread Andi Shyti
Hi Zhanjun,

On Fri, Aug 11, 2023 at 11:20:11AM -0700, Zhanjun Dong wrote:
> This attempts to avoid circular locking dependency between flush delayed
> work and intel_gt_reset.
> When intel_gt_reset was called, task will hold a lock.
> To cacel delayed work here, the _sync version will also acquire a lock,

/cacel/cancel

> which might trigger the possible cirular locking dependency warning.
> When intel_gt_reset called, reset_in_progress flag will be set, add code
> to check the flag, call async verion if reset is in progress.

/verion/version/

> Signed-off-by: Zhanjun Dong 
> Cc: John Harrison 
> Cc: Andi Shyti 
> Cc: Daniel Vetter 
> ---

There is no changelog here :/

Can you please add the changelog after the '---' section?

The commit log has changed and...

>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index a0e3ef1c65d2..600388c849f7 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct 
> intel_guc *guc)
>  
>  static void guc_cancel_busyness_worker(struct intel_guc *guc)
>  {
> - cancel_delayed_work_sync(>timestamp.work);
> + /*
> +  * When intel_gt_reset was called, task will hold a lock.
> +  * To cacel delayed work here, the _sync version will also acquire a 
> lock, which might
> +  * trigger the possible cirular locking dependency warning.
> +  * Check the reset_in_progress flag, call async verion if reset is in 
> progress.
> +  */
> + if (guc_to_gt(guc)->uc.reset_in_progress)
> + cancel_delayed_work(>timestamp.work);
> + else
> + cancel_delayed_work_sync(>timestamp.work);

... now you are checking out of reset_in_progress.

Normally the convention here is to have the *_locked() version of
the function. But I'm OK with this, as well... John, any opinion?

Anyway, comparing with your previous patch the decision is made
out of different elements and only __reset_guc_busyness_stats()
needed this change.

Andi

>  }
>  
>  static void __reset_guc_busyness_stats(struct intel_guc *guc)
> -- 
> 2.34.1


[Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-11 Thread Zhanjun Dong
This attempts to avoid circular locking dependency between flush delayed
work and intel_gt_reset.
When intel_gt_reset was called, task will hold a lock.
To cacel delayed work here, the _sync version will also acquire a lock,
which might trigger the possible cirular locking dependency warning.
When intel_gt_reset called, reset_in_progress flag will be set, add code
to check the flag, call async verion if reset is in progress.

Signed-off-by: Zhanjun Dong 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Daniel Vetter 
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index a0e3ef1c65d2..600388c849f7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct intel_guc 
*guc)
 
 static void guc_cancel_busyness_worker(struct intel_guc *guc)
 {
-   cancel_delayed_work_sync(>timestamp.work);
+   /*
+* When intel_gt_reset was called, task will hold a lock.
+* To cacel delayed work here, the _sync version will also acquire a 
lock, which might
+* trigger the possible cirular locking dependency warning.
+* Check the reset_in_progress flag, call async verion if reset is in 
progress.
+*/
+   if (guc_to_gt(guc)->uc.reset_in_progress)
+   cancel_delayed_work(>timestamp.work);
+   else
+   cancel_delayed_work_sync(>timestamp.work);
 }
 
 static void __reset_guc_busyness_stats(struct intel_guc *guc)
-- 
2.34.1