Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread Matt Roper
On Wed, Oct 04, 2023 at 10:58:32PM +0200, Andi Shyti wrote:
> Hi Matt,
> 
> > > > > > > The MCR steering semaphore is a shared lock entry between i915
> > > > > > > and various firmware components.
> > > > > > > 
> > > > > > > Getting the lock might sinchronize on some shared resources.
> > > > > > > Sometimes though, it might happen that the firmware forgets to
> > > > > > > unlock causing unnecessary noise in the driver which keeps doing
> > > > > > > what was supposed to do, ignoring the problem.
> > > > > > > 
> > > > > > > Do not consider this failure as an error, but just print a debug
> > > > > > > message stating that the MCR locking has been skipped.
> > > > > > > 
> > > > > > > On the driver side we still have spinlocks that make sure that
> > > > > > > the access to the resources is serialized.
> > > > > > > 
> > > > > > > Signed-off-by: Andi Shyti 
> > > > > > > Cc: Jonathan Cavitt 
> > > > > > > Cc: Matt Roper 
> > > > > > > Cc: Nirmoy Das 
> > > > > > > ---
> > > > > > >drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
> > > > > > >1 file changed, 2 insertions(+), 4 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
> > > > > > > b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > > index 326c2ed1d99b..51eb693df39b 100644
> > > > > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > > @@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, 
> > > > > > > unsigned long *flags)
> > > > > > >* would indicate some hardware/firmware is misbehaving 
> > > > > > > and not
> > > > > > >* releasing it properly.
> > > > > > >*/
> > > > > > > - if (err == -ETIMEDOUT) {
> > > > > > > - gt_err_ratelimited(gt, "hardware MCR steering semaphore 
> > > > > > > timed out");
> > > > > > > - add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
> > > > > > > unreliable */
> > > > > > > - }
> > > > > > > + if (err == -ETIMEDOUT)
> > > > > > > + gt_dbg(gt, "hardware MCR steering semaphore timed out");
> > > > > > >}
> > > > > > >/**
> > > > > > Are we sure this does not warrant a level higher than dbg, such as
> > > > > > notice/warn?
> > > > > We might make it a warn, but this doesn't change much the economy
> > > > > of the driver as we will keep doing what we were supposed to do.
> > > > > 
> > > > > > Because how can we be sure the two entities will not stomp on
> > > > > > each other toes if we failed to obtain lock?
> > > > > So far, in all the research I've done, no one looks like using
> > > > > MCR lock, but yet someone is stuck in it.
> > > > 
> > > > If someone has the lock then that someone thinks they are using it. Just
> > > > because you can't see what someone piece of IFWI is doing doesn't mean 
> > > > it
> > > > isn't doing it. And if it is a genuinely missing unlock then it needs 
> > > > to be
> > > > tracked down and fixed with an IFWI update otherwise the system is 
> > > > going to
> > > > be unstable from that point on.
> > > 
> > > But I'm not changing here the behavior of the driver. The driver
> > > will keep doing what was doing before.
> > > 
> > > Because this most probably won't be noticed by the user, then I
> > > don't see why it should shout out loud that the system is
> > > unusable when most probably it is.
> > 
> > That's like saying that any random race condition isn't likely to be
> > noticed by the user so it's not a big deal if we're missing a few
> > mutexes or spinlocks somewhere...even though there's likely to be no
> > user-visible impact to any race condition 99% of the time, it's the 1%
> > that winds up being absolutely catastrophic.
> 
> Not really... normally if you hit a spinlock/mutex race
> condition, you end up in a deadlock and stall the system. In this
> case, I agree that the lock should be sorted out by the hardware,
> but in the meantime i915 is *already* ignoring it.
> 
> I'm not making any behavioral change with this patch.
> 
> What I'm trying to say is that if the system doesn't crash, then
> let it go... don't crash it on purpose just because there is a
> locking situation and we think it has a catastrophic effect, but
> in reality is not producing anything (like practically in our
> case, you can check the CI results[*]).
> 
> [*] https://patchwork.freedesktop.org/patch/560733/?series=124599=1

But hiding the error message is the opposite of the direction we should
be moving.  If anything we need to be escalating this harder, for
example by wedging the GT to truly prevent if from being used further.
We obviously don't want a BUG() or something that would crash the whole
system and potentially cause non-graphics data loss, but we really
shouldn't let regular operation keep going.  The goal of ignoring the
the semaphore and moving forward is that we keep the system alive for a
developer to gather more debug information.

If this is ever seen in the wild by 

Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread John Harrison

On 10/4/2023 13:58, Andi Shyti wrote:

Hi Matt,


The MCR steering semaphore is a shared lock entry between i915
and various firmware components.

Getting the lock might sinchronize on some shared resources.
Sometimes though, it might happen that the firmware forgets to
unlock causing unnecessary noise in the driver which keeps doing
what was supposed to do, ignoring the problem.

Do not consider this failure as an error, but just print a debug
message stating that the MCR locking has been skipped.

On the driver side we still have spinlocks that make sure that
the access to the resources is serialized.

Signed-off-by: Andi Shyti 
Cc: Jonathan Cavitt 
Cc: Matt Roper 
Cc: Nirmoy Das 
---
drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index 326c2ed1d99b..51eb693df39b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long 
*flags)
 * would indicate some hardware/firmware is misbehaving and not
 * releasing it properly.
 */
-   if (err == -ETIMEDOUT) {
-   gt_err_ratelimited(gt, "hardware MCR steering semaphore timed 
out");
-   add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
unreliable */
-   }
+   if (err == -ETIMEDOUT)
+   gt_dbg(gt, "hardware MCR steering semaphore timed out");
}
/**

Are we sure this does not warrant a level higher than dbg, such as
notice/warn?

We might make it a warn, but this doesn't change much the economy
of the driver as we will keep doing what we were supposed to do.


Because how can we be sure the two entities will not stomp on
each other toes if we failed to obtain lock?

So far, in all the research I've done, no one looks like using
MCR lock, but yet someone is stuck in it.

If someone has the lock then that someone thinks they are using it. Just
because you can't see what someone piece of IFWI is doing doesn't mean it
isn't doing it. And if it is a genuinely missing unlock then it needs to be
tracked down and fixed with an IFWI update otherwise the system is going to
be unstable from that point on.

But I'm not changing here the behavior of the driver. The driver
will keep doing what was doing before.

Because this most probably won't be noticed by the user, then I
don't see why it should shout out loud that the system is
unusable when most probably it is.

That's like saying that any random race condition isn't likely to be
noticed by the user so it's not a big deal if we're missing a few
mutexes or spinlocks somewhere...even though there's likely to be no
user-visible impact to any race condition 99% of the time, it's the 1%
that winds up being absolutely catastrophic.

Not really... normally if you hit a spinlock/mutex race
condition, you end up in a deadlock and stall the system. In this
case, I agree that the lock should be sorted out by the hardware,
but in the meantime i915 is *already* ignoring it.
Um, "a deadlock and stall the system" is exactly what is happening here. 
To prevent a total hang, we are ignoring the deadlock and proceeding 
anyway. Essentially moving to the situation of having a critical section 
which is not protected by the mutex at all.  No matter how you phrase 
it, that is a critical section failure and you do not know how the 1% 
failure might manifest.




I'm not making any behavioral change with this patch.

What I'm trying to say is that if the system doesn't crash, then
let it go... don't crash it on purpose just because there is a
locking situation and we think it has a catastrophic effect, but
in reality is not producing anything (like practically in our
case, you can check the CI results[*]).
We are not going to 'crash it on purpose'. We are printing out an error 
message to say that an error has occurred. Which it has. And as above, 
just because you haven't noticed hitting a catastrophic race condition 
yet doesn't mean that it isn't there.


John.



[*] https://patchwork.freedesktop.org/patch/560733/?series=124599=1


If we're not obtaining the MCR lock as expected and are simply moving
forward to force our own steering (possibly at the same time firmware is
programming steering to a different value), you probably won't actually
see a problem either because the operations won't wind up interleaving
in a problematic order, or because the driver and the firmware both
happen to be trying to steer to the same instance (e.g., instance #0 is
a quite common target).  But even if they're hard to hit, the
possibility for a major problem is still there and basically we need to
consider the whole platform to be in an unknown, unstable state once
we've detected one of these issues.

Based on some earlier experiments, it sounds like the problem at the
moment is that we've just been too 

Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread John Harrison

On 10/4/2023 13:09, Andi Shyti wrote:

Hi John,


The MCR steering semaphore is a shared lock entry between i915
and various firmware components.

Getting the lock might sinchronize on some shared resources.
Sometimes though, it might happen that the firmware forgets to
unlock causing unnecessary noise in the driver which keeps doing
what was supposed to do, ignoring the problem.

Do not consider this failure as an error, but just print a debug
message stating that the MCR locking has been skipped.

On the driver side we still have spinlocks that make sure that
the access to the resources is serialized.

Signed-off-by: Andi Shyti 
Cc: Jonathan Cavitt 
Cc: Matt Roper 
Cc: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index 326c2ed1d99b..51eb693df39b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long 
*flags)
 * would indicate some hardware/firmware is misbehaving and not
 * releasing it properly.
 */
-   if (err == -ETIMEDOUT) {
-   gt_err_ratelimited(gt, "hardware MCR steering semaphore timed 
out");
-   add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
unreliable */
-   }
+   if (err == -ETIMEDOUT)
+   gt_dbg(gt, "hardware MCR steering semaphore timed out");
 }
 /**

Are we sure this does not warrant a level higher than dbg, such as
notice/warn?

We might make it a warn, but this doesn't change much the economy
of the driver as we will keep doing what we were supposed to do.


Because how can we be sure the two entities will not stomp on
each other toes if we failed to obtain lock?

So far, in all the research I've done, no one looks like using
MCR lock, but yet someone is stuck in it.

If someone has the lock then that someone thinks they are using it. Just
because you can't see what someone piece of IFWI is doing doesn't mean it
isn't doing it. And if it is a genuinely missing unlock then it needs to be
tracked down and fixed with an IFWI update otherwise the system is going to
be unstable from that point on.

But I'm not changing here the behavior of the driver. The driver
will keep doing what was doing before.

And what it is doing is dangerous and can lead to unpredictable results
because a critical section resource access is no longer wrapped with a
critical section lock. Hence there is an error message to say Here Be
Dragons.

hehe!

What are you suggesting, then? Should we reset the GT after
hitting the MCR lock issue?
No. I'm suggesting that you don't hide the issue by removing the error 
message. It is there for a reason. Just because that reason is being hit 
doesn't mean you should remove the message.




We could, but I rather prefer to work with what's available.


Because this most probably won't be noticed by the user, then I
don't see why it should shout out loud that the system is
unusable when most probably it is.

Just because a race condition is hard to hit doesn't mean it won't be hit.

We are hitting it, constantly, but it's not producing any effect.
Even when raising the MCR wait timeout. Which means that most
probably someone is having a nap on the lock.
No. You are hitting the lock contention problem constantly and so far 
are not seeing any *visible* effect.


The point is that there is still a potential race condition which you 
haven't hit yet which could lead to data corruption, page faults, 
crashes, etc. (because a TLB invalidation access went to the wrong 
target) or the CPU/GPU melting itself of the board (because a power 
management access went to the wrong target).





The point of shouting out loud is that we know for a fact a problem has
occurred. That problem might lead to nothing or it might lead to subtle data
corruption, gross crashes or anything in between.

yes, agree... the message still stays, though, with a bit of a
lower catastrophy.


BTW, at my understanding this is not an IFWI problem. We checked
with different version of IFWI and the issue is the same.

Which implies it is a driver bug after all? In which case you absolutely
should not be papering over it in the driver.

This section is serialized by a spinlock and besides I haven't
found any place else except for the TLB invalidation and the
resume where we can incur a locking situation.
There is a bug somewhere. Either it is IFWI or it is KMD. You can't say 
"I can't find a problem therefore there is no problem".


John.




Thanks a lot for your inputs and discussion!
Andi




Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread Andi Shyti
Hi Matt,

> > > > > > The MCR steering semaphore is a shared lock entry between i915
> > > > > > and various firmware components.
> > > > > > 
> > > > > > Getting the lock might sinchronize on some shared resources.
> > > > > > Sometimes though, it might happen that the firmware forgets to
> > > > > > unlock causing unnecessary noise in the driver which keeps doing
> > > > > > what was supposed to do, ignoring the problem.
> > > > > > 
> > > > > > Do not consider this failure as an error, but just print a debug
> > > > > > message stating that the MCR locking has been skipped.
> > > > > > 
> > > > > > On the driver side we still have spinlocks that make sure that
> > > > > > the access to the resources is serialized.
> > > > > > 
> > > > > > Signed-off-by: Andi Shyti 
> > > > > > Cc: Jonathan Cavitt 
> > > > > > Cc: Matt Roper 
> > > > > > Cc: Nirmoy Das 
> > > > > > ---
> > > > > >drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
> > > > > >1 file changed, 2 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
> > > > > > b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > index 326c2ed1d99b..51eb693df39b 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > @@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, 
> > > > > > unsigned long *flags)
> > > > > >  * would indicate some hardware/firmware is misbehaving and not
> > > > > >  * releasing it properly.
> > > > > >  */
> > > > > > -   if (err == -ETIMEDOUT) {
> > > > > > -   gt_err_ratelimited(gt, "hardware MCR steering semaphore 
> > > > > > timed out");
> > > > > > -   add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
> > > > > > unreliable */
> > > > > > -   }
> > > > > > +   if (err == -ETIMEDOUT)
> > > > > > +   gt_dbg(gt, "hardware MCR steering semaphore timed out");
> > > > > >}
> > > > > >/**
> > > > > Are we sure this does not warrant a level higher than dbg, such as
> > > > > notice/warn?
> > > > We might make it a warn, but this doesn't change much the economy
> > > > of the driver as we will keep doing what we were supposed to do.
> > > > 
> > > > > Because how can we be sure the two entities will not stomp on
> > > > > each other toes if we failed to obtain lock?
> > > > So far, in all the research I've done, no one looks like using
> > > > MCR lock, but yet someone is stuck in it.
> > > 
> > > If someone has the lock then that someone thinks they are using it. Just
> > > because you can't see what someone piece of IFWI is doing doesn't mean it
> > > isn't doing it. And if it is a genuinely missing unlock then it needs to 
> > > be
> > > tracked down and fixed with an IFWI update otherwise the system is going 
> > > to
> > > be unstable from that point on.
> > 
> > But I'm not changing here the behavior of the driver. The driver
> > will keep doing what was doing before.
> > 
> > Because this most probably won't be noticed by the user, then I
> > don't see why it should shout out loud that the system is
> > unusable when most probably it is.
> 
> That's like saying that any random race condition isn't likely to be
> noticed by the user so it's not a big deal if we're missing a few
> mutexes or spinlocks somewhere...even though there's likely to be no
> user-visible impact to any race condition 99% of the time, it's the 1%
> that winds up being absolutely catastrophic.

Not really... normally if you hit a spinlock/mutex race
condition, you end up in a deadlock and stall the system. In this
case, I agree that the lock should be sorted out by the hardware,
but in the meantime i915 is *already* ignoring it.

I'm not making any behavioral change with this patch.

What I'm trying to say is that if the system doesn't crash, then
let it go... don't crash it on purpose just because there is a
locking situation and we think it has a catastrophic effect, but
in reality is not producing anything (like practically in our
case, you can check the CI results[*]).

[*] https://patchwork.freedesktop.org/patch/560733/?series=124599=1

> If we're not obtaining the MCR lock as expected and are simply moving
> forward to force our own steering (possibly at the same time firmware is
> programming steering to a different value), you probably won't actually
> see a problem either because the operations won't wind up interleaving
> in a problematic order, or because the driver and the firmware both
> happen to be trying to steer to the same instance (e.g., instance #0 is
> a quite common target).  But even if they're hard to hit, the
> possibility for a major problem is still there and basically we need to
> consider the whole platform to be in an unknown, unstable state once
> we've detected one of these issues.
> 
> Based on some earlier experiments, it sounds like the problem at the
> moment is that we've just been too hasty with deciding that the lock is
> 

Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread Andi Shyti
Hi John,

> > > > > > The MCR steering semaphore is a shared lock entry between i915
> > > > > > and various firmware components.
> > > > > > 
> > > > > > Getting the lock might sinchronize on some shared resources.
> > > > > > Sometimes though, it might happen that the firmware forgets to
> > > > > > unlock causing unnecessary noise in the driver which keeps doing
> > > > > > what was supposed to do, ignoring the problem.
> > > > > > 
> > > > > > Do not consider this failure as an error, but just print a debug
> > > > > > message stating that the MCR locking has been skipped.
> > > > > > 
> > > > > > On the driver side we still have spinlocks that make sure that
> > > > > > the access to the resources is serialized.
> > > > > > 
> > > > > > Signed-off-by: Andi Shyti 
> > > > > > Cc: Jonathan Cavitt 
> > > > > > Cc: Matt Roper 
> > > > > > Cc: Nirmoy Das 
> > > > > > ---
> > > > > > drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
> > > > > > 1 file changed, 2 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
> > > > > > b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > index 326c2ed1d99b..51eb693df39b 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > > @@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, 
> > > > > > unsigned long *flags)
> > > > > >  * would indicate some hardware/firmware is misbehaving 
> > > > > > and not
> > > > > >  * releasing it properly.
> > > > > >  */
> > > > > > -   if (err == -ETIMEDOUT) {
> > > > > > -   gt_err_ratelimited(gt, "hardware MCR steering semaphore 
> > > > > > timed out");
> > > > > > -   add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
> > > > > > unreliable */
> > > > > > -   }
> > > > > > +   if (err == -ETIMEDOUT)
> > > > > > +   gt_dbg(gt, "hardware MCR steering semaphore timed out");
> > > > > > }
> > > > > > /**
> > > > > Are we sure this does not warrant a level higher than dbg, such as
> > > > > notice/warn?
> > > > We might make it a warn, but this doesn't change much the economy
> > > > of the driver as we will keep doing what we were supposed to do.
> > > > 
> > > > > Because how can we be sure the two entities will not stomp on
> > > > > each other toes if we failed to obtain lock?
> > > > So far, in all the research I've done, no one looks like using
> > > > MCR lock, but yet someone is stuck in it.
> > > If someone has the lock then that someone thinks they are using it. Just
> > > because you can't see what someone piece of IFWI is doing doesn't mean it
> > > isn't doing it. And if it is a genuinely missing unlock then it needs to 
> > > be
> > > tracked down and fixed with an IFWI update otherwise the system is going 
> > > to
> > > be unstable from that point on.
> > But I'm not changing here the behavior of the driver. The driver
> > will keep doing what was doing before.
> And what it is doing is dangerous and can lead to unpredictable results
> because a critical section resource access is no longer wrapped with a
> critical section lock. Hence there is an error message to say Here Be
> Dragons.

hehe!

What are you suggesting, then? Should we reset the GT after
hitting the MCR lock issue?

We could, but I rather prefer to work with what's available.

> > Because this most probably won't be noticed by the user, then I
> > don't see why it should shout out loud that the system is
> > unusable when most probably it is.
> Just because a race condition is hard to hit doesn't mean it won't be hit.

We are hitting it, constantly, but it's not producing any effect.
Even when raising the MCR wait timeout. Which means that most
probably someone is having a nap on the lock.

> The point of shouting out loud is that we know for a fact a problem has
> occurred. That problem might lead to nothing or it might lead to subtle data
> corruption, gross crashes or anything in between.

yes, agree... the message still stays, though, with a bit of a
lower catastrophy.

> > BTW, at my understanding this is not an IFWI problem. We checked
> > with different version of IFWI and the issue is the same.
> Which implies it is a driver bug after all? In which case you absolutely
> should not be papering over it in the driver.

This section is serialized by a spinlock and besides I haven't
found any place else except for the TLB invalidation and the
resume where we can incur a locking situation.

Thanks a lot for your inputs and discussion!
Andi


Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread Matt Roper
On Wed, Oct 04, 2023 at 09:35:27PM +0200, Andi Shyti wrote:
> Hi John,
> 
> > > > > The MCR steering semaphore is a shared lock entry between i915
> > > > > and various firmware components.
> > > > > 
> > > > > Getting the lock might sinchronize on some shared resources.
> > > > > Sometimes though, it might happen that the firmware forgets to
> > > > > unlock causing unnecessary noise in the driver which keeps doing
> > > > > what was supposed to do, ignoring the problem.
> > > > > 
> > > > > Do not consider this failure as an error, but just print a debug
> > > > > message stating that the MCR locking has been skipped.
> > > > > 
> > > > > On the driver side we still have spinlocks that make sure that
> > > > > the access to the resources is serialized.
> > > > > 
> > > > > Signed-off-by: Andi Shyti 
> > > > > Cc: Jonathan Cavitt 
> > > > > Cc: Matt Roper 
> > > > > Cc: Nirmoy Das 
> > > > > ---
> > > > >drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
> > > > >1 file changed, 2 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
> > > > > b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > index 326c2ed1d99b..51eb693df39b 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > > @@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, 
> > > > > unsigned long *flags)
> > > > >* would indicate some hardware/firmware is misbehaving and not
> > > > >* releasing it properly.
> > > > >*/
> > > > > - if (err == -ETIMEDOUT) {
> > > > > - gt_err_ratelimited(gt, "hardware MCR steering semaphore 
> > > > > timed out");
> > > > > - add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
> > > > > unreliable */
> > > > > - }
> > > > > + if (err == -ETIMEDOUT)
> > > > > + gt_dbg(gt, "hardware MCR steering semaphore timed out");
> > > > >}
> > > > >/**
> > > > Are we sure this does not warrant a level higher than dbg, such as
> > > > notice/warn?
> > > We might make it a warn, but this doesn't change much the economy
> > > of the driver as we will keep doing what we were supposed to do.
> > > 
> > > > Because how can we be sure the two entities will not stomp on
> > > > each other toes if we failed to obtain lock?
> > > So far, in all the research I've done, no one looks like using
> > > MCR lock, but yet someone is stuck in it.
> > 
> > If someone has the lock then that someone thinks they are using it. Just
> > because you can't see what someone piece of IFWI is doing doesn't mean it
> > isn't doing it. And if it is a genuinely missing unlock then it needs to be
> > tracked down and fixed with an IFWI update otherwise the system is going to
> > be unstable from that point on.
> 
> But I'm not changing here the behavior of the driver. The driver
> will keep doing what was doing before.
> 
> Because this most probably won't be noticed by the user, then I
> don't see why it should shout out loud that the system is
> unusable when most probably it is.

That's like saying that any random race condition isn't likely to be
noticed by the user so it's not a big deal if we're missing a few
mutexes or spinlocks somewhere...even though there's likely to be no
user-visible impact to any race condition 99% of the time, it's the 1%
that winds up being absolutely catastrophic.

If we're not obtaining the MCR lock as expected and are simply moving
forward to force our own steering (possibly at the same time firmware is
programming steering to a different value), you probably won't actually
see a problem either because the operations won't wind up interleaving
in a problematic order, or because the driver and the firmware both
happen to be trying to steer to the same instance (e.g., instance #0 is
a quite common target).  But even if they're hard to hit, the
possibility for a major problem is still there and basically we need to
consider the whole platform to be in an unknown, unstable state once
we've detected one of these issues.

Based on some earlier experiments, it sounds like the problem at the
moment is that we've just been too hasty with deciding that the lock is
"stuck."  There's no formal guidance on what an appropriate timeout is,
but Jonathan's patch to increase the timeout by 10x (from 100ms to 1s)
should hopefully take care of those cases and prevent us from causing
any unprotected races.  If we have any other problems where the lock is
truly stuck (as was seen after an S3 resume), that's a critical error
that needs to be escalated to whichever component is the culprit --- any
such system simply cannot be used safely.  Even if the KMD didn't notice
such stuck semaphores itself, one misbehaving firmware agent could still
block other firmware agents and cause major problems for the health of
the system.

> 
> BTW, at my understanding this is not an IFWI problem. We checked
> with different version of IFWI 

Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread John Harrison

On 10/4/2023 12:35, Andi Shyti wrote:

Hi John,


The MCR steering semaphore is a shared lock entry between i915
and various firmware components.

Getting the lock might sinchronize on some shared resources.
Sometimes though, it might happen that the firmware forgets to
unlock causing unnecessary noise in the driver which keeps doing
what was supposed to do, ignoring the problem.

Do not consider this failure as an error, but just print a debug
message stating that the MCR locking has been skipped.

On the driver side we still have spinlocks that make sure that
the access to the resources is serialized.

Signed-off-by: Andi Shyti 
Cc: Jonathan Cavitt 
Cc: Matt Roper 
Cc: Nirmoy Das 
---
drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index 326c2ed1d99b..51eb693df39b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long 
*flags)
 * would indicate some hardware/firmware is misbehaving and not
 * releasing it properly.
 */
-   if (err == -ETIMEDOUT) {
-   gt_err_ratelimited(gt, "hardware MCR steering semaphore timed 
out");
-   add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
unreliable */
-   }
+   if (err == -ETIMEDOUT)
+   gt_dbg(gt, "hardware MCR steering semaphore timed out");
}
/**

Are we sure this does not warrant a level higher than dbg, such as
notice/warn?

We might make it a warn, but this doesn't change much the economy
of the driver as we will keep doing what we were supposed to do.


Because how can we be sure the two entities will not stomp on
each other toes if we failed to obtain lock?

So far, in all the research I've done, no one looks like using
MCR lock, but yet someone is stuck in it.

If someone has the lock then that someone thinks they are using it. Just
because you can't see what someone piece of IFWI is doing doesn't mean it
isn't doing it. And if it is a genuinely missing unlock then it needs to be
tracked down and fixed with an IFWI update otherwise the system is going to
be unstable from that point on.

But I'm not changing here the behavior of the driver. The driver
will keep doing what was doing before.
And what it is doing is dangerous and can lead to unpredictable results 
because a critical section resource access is no longer wrapped with a 
critical section lock. Hence there is an error message to say Here Be 
Dragons.





Because this most probably won't be noticed by the user, then I
don't see why it should shout out loud that the system is
unusable when most probably it is.

Just because a race condition is hard to hit doesn't mean it won't be hit.

The point of shouting out loud is that we know for a fact a problem has 
occurred. That problem might lead to nothing or it might lead to subtle 
data corruption, gross crashes or anything in between.




BTW, at my understanding this is not an IFWI problem. We checked
with different version of IFWI and the issue is the same.
Which implies it is a driver bug after all? In which case you absolutely 
should not be papering over it in the driver.




Besides we received reports also from systems that are not using
IFWI at all.
There is no system that does not use IFWI. Integrated or discrete, all 
systems have an IFWI. It's just part of the main BIOS on an integrated 
platform.


John.




(How can we be sure about
"forgot to unlock" vs "in prolonged active use"?

There is a patch from Jonathan that is testing a different
timeout.


Or if we can be sure, can
we force unlock and take the lock for the driver explicitly?)

I sent a patch with this solution and Matt turned it down.

Presumably because both forcing a lock and ignoring a failed lock are Bad
Things to be doing!
Just because some entity out of our control isn't playing friendly doesn't
mean we can ignore the problem. The lock is there for a reason. If someone
else owns the lock then any steered access by i915 is potentially going to
be routed to the wrong register as the other entity re-directs the steering
behind out back. That is why there is an error message being printed.
Because things are quite possibly going to fail in some unknown manner.

Yes, agree. This has been already discussed here[*] where I sent
such RFC for the sole purpose of receiving some opinions and
check how CI would behave.

BTW, we are already doing this during the GT resume[**]... it's a
bit of a different context, but it still forces the release of
the lock.

This patch, anyway, is not doing this.

Thanks a lot,
Andi

[*] https://patchwork.freedesktop.org/series/124402/
[**] 37280ef5c1c4 ("drm/i915: Clean steer semaphore on resume")




Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread Andi Shyti
Hi John,

> > > > The MCR steering semaphore is a shared lock entry between i915
> > > > and various firmware components.
> > > > 
> > > > Getting the lock might sinchronize on some shared resources.
> > > > Sometimes though, it might happen that the firmware forgets to
> > > > unlock causing unnecessary noise in the driver which keeps doing
> > > > what was supposed to do, ignoring the problem.
> > > > 
> > > > Do not consider this failure as an error, but just print a debug
> > > > message stating that the MCR locking has been skipped.
> > > > 
> > > > On the driver side we still have spinlocks that make sure that
> > > > the access to the resources is serialized.
> > > > 
> > > > Signed-off-by: Andi Shyti 
> > > > Cc: Jonathan Cavitt 
> > > > Cc: Matt Roper 
> > > > Cc: Nirmoy Das 
> > > > ---
> > > >drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
> > > >1 file changed, 2 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
> > > > b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > index 326c2ed1d99b..51eb693df39b 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > > > @@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, 
> > > > unsigned long *flags)
> > > >  * would indicate some hardware/firmware is misbehaving and not
> > > >  * releasing it properly.
> > > >  */
> > > > -   if (err == -ETIMEDOUT) {
> > > > -   gt_err_ratelimited(gt, "hardware MCR steering semaphore 
> > > > timed out");
> > > > -   add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
> > > > unreliable */
> > > > -   }
> > > > +   if (err == -ETIMEDOUT)
> > > > +   gt_dbg(gt, "hardware MCR steering semaphore timed out");
> > > >}
> > > >/**
> > > Are we sure this does not warrant a level higher than dbg, such as
> > > notice/warn?
> > We might make it a warn, but this doesn't change much the economy
> > of the driver as we will keep doing what we were supposed to do.
> > 
> > > Because how can we be sure the two entities will not stomp on
> > > each other toes if we failed to obtain lock?
> > So far, in all the research I've done, no one looks like using
> > MCR lock, but yet someone is stuck in it.
> 
> If someone has the lock then that someone thinks they are using it. Just
> because you can't see what someone piece of IFWI is doing doesn't mean it
> isn't doing it. And if it is a genuinely missing unlock then it needs to be
> tracked down and fixed with an IFWI update otherwise the system is going to
> be unstable from that point on.

But I'm not changing here the behavior of the driver. The driver
will keep doing what was doing before.

Because this most probably won't be noticed by the user, then I
don't see why it should shout out loud that the system is
unusable when most probably it is.

BTW, at my understanding this is not an IFWI problem. We checked
with different version of IFWI and the issue is the same.

Besides we received reports also from systems that are not using
IFWI at all.

> 
> > 
> > > (How can we be sure about
> > > "forgot to unlock" vs "in prolonged active use"?
> > There is a patch from Jonathan that is testing a different
> > timeout.
> > 
> > > Or if we can be sure, can
> > > we force unlock and take the lock for the driver explicitly?)
> > I sent a patch with this solution and Matt turned it down.
> 
> Presumably because both forcing a lock and ignoring a failed lock are Bad
> Things to be doing!
> Just because some entity out of our control isn't playing friendly doesn't
> mean we can ignore the problem. The lock is there for a reason. If someone
> else owns the lock then any steered access by i915 is potentially going to
> be routed to the wrong register as the other entity re-directs the steering
> behind out back. That is why there is an error message being printed.
> Because things are quite possibly going to fail in some unknown manner.

Yes, agree. This has been already discussed here[*] where I sent
such RFC for the sole purpose of receiving some opinions and
check how CI would behave.

BTW, we are already doing this during the GT resume[**]... it's a
bit of a different context, but it still forces the release of
the lock.

This patch, anyway, is not doing this.

Thanks a lot,
Andi

[*] https://patchwork.freedesktop.org/series/124402/
[**] 37280ef5c1c4 ("drm/i915: Clean steer semaphore on resume")


Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread John Harrison

On 10/4/2023 07:08, Andi Shyti wrote:

Hi Tvrtko,


The MCR steering semaphore is a shared lock entry between i915
and various firmware components.

Getting the lock might sinchronize on some shared resources.
Sometimes though, it might happen that the firmware forgets to
unlock causing unnecessary noise in the driver which keeps doing
what was supposed to do, ignoring the problem.

Do not consider this failure as an error, but just print a debug
message stating that the MCR locking has been skipped.

On the driver side we still have spinlocks that make sure that
the access to the resources is serialized.

Signed-off-by: Andi Shyti 
Cc: Jonathan Cavitt 
Cc: Matt Roper 
Cc: Nirmoy Das 
---
   drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
   1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index 326c2ed1d99b..51eb693df39b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long 
*flags)
 * would indicate some hardware/firmware is misbehaving and not
 * releasing it properly.
 */
-   if (err == -ETIMEDOUT) {
-   gt_err_ratelimited(gt, "hardware MCR steering semaphore timed 
out");
-   add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
unreliable */
-   }
+   if (err == -ETIMEDOUT)
+   gt_dbg(gt, "hardware MCR steering semaphore timed out");
   }
   /**

Are we sure this does not warrant a level higher than dbg, such as
notice/warn?

We might make it a warn, but this doesn't change much the economy
of the driver as we will keep doing what we were supposed to do.


Because how can we be sure the two entities will not stomp on
each other toes if we failed to obtain lock?

So far, in all the research I've done, no one looks like using
MCR lock, but yet someone is stuck in it.
If someone has the lock then that someone thinks they are using it. Just 
because you can't see what someone piece of IFWI is doing doesn't mean 
it isn't doing it. And if it is a genuinely missing unlock then it needs 
to be tracked down and fixed with an IFWI update otherwise the system is 
going to be unstable from that point on.





(How can we be sure about
"forgot to unlock" vs "in prolonged active use"?

There is a patch from Jonathan that is testing a different
timeout.


Or if we can be sure, can
we force unlock and take the lock for the driver explicitly?)

I sent a patch with this solution and Matt turned it down.
Presumably because both forcing a lock and ignoring a failed lock are 
Bad Things to be doing!


Just because some entity out of our control isn't playing friendly 
doesn't mean we can ignore the problem. The lock is there for a reason. 
If someone else owns the lock then any steered access by i915 is 
potentially going to be routed to the wrong register as the other entity 
re-directs the steering behind out back. That is why there is an error 
message being printed. Because things are quite possibly going to fail 
in some unknown manner.


John.



Andi




Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread Andi Shyti
Hi Tvrtko,

> > The MCR steering semaphore is a shared lock entry between i915
> > and various firmware components.
> > 
> > Getting the lock might sinchronize on some shared resources.
> > Sometimes though, it might happen that the firmware forgets to
> > unlock causing unnecessary noise in the driver which keeps doing
> > what was supposed to do, ignoring the problem.
> > 
> > Do not consider this failure as an error, but just print a debug
> > message stating that the MCR locking has been skipped.
> > 
> > On the driver side we still have spinlocks that make sure that
> > the access to the resources is serialized.
> > 
> > Signed-off-by: Andi Shyti 
> > Cc: Jonathan Cavitt 
> > Cc: Matt Roper 
> > Cc: Nirmoy Das 
> > ---
> >   drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
> >   1 file changed, 2 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
> > b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > index 326c2ed1d99b..51eb693df39b 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> > @@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned 
> > long *flags)
> >  * would indicate some hardware/firmware is misbehaving and not
> >  * releasing it properly.
> >  */
> > -   if (err == -ETIMEDOUT) {
> > -   gt_err_ratelimited(gt, "hardware MCR steering semaphore timed 
> > out");
> > -   add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
> > unreliable */
> > -   }
> > +   if (err == -ETIMEDOUT)
> > +   gt_dbg(gt, "hardware MCR steering semaphore timed out");
> >   }
> >   /**
> 
> Are we sure this does not warrant a level higher than dbg, such as
> notice/warn?

We might make it a warn, but this doesn't change much the economy
of the driver as we will keep doing what we were supposed to do.

> Because how can we be sure the two entities will not stomp on
> each other toes if we failed to obtain lock?

So far, in all the research I've done, no one looks like using
MCR lock, but yet someone is stuck in it.

> (How can we be sure about
> "forgot to unlock" vs "in prolonged active use"?

There is a patch from Jonathan that is testing a different
timeout.

> Or if we can be sure, can
> we force unlock and take the lock for the driver explicitly?)

I sent a patch with this solution and Matt turned it down.

Andi


Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread Tvrtko Ursulin



On 04/10/2023 10:43, Andi Shyti wrote:

The MCR steering semaphore is a shared lock entry between i915
and various firmware components.

Getting the lock might sinchronize on some shared resources.
Sometimes though, it might happen that the firmware forgets to
unlock causing unnecessary noise in the driver which keeps doing
what was supposed to do, ignoring the problem.

Do not consider this failure as an error, but just print a debug
message stating that the MCR locking has been skipped.

On the driver side we still have spinlocks that make sure that
the access to the resources is serialized.

Signed-off-by: Andi Shyti 
Cc: Jonathan Cavitt 
Cc: Matt Roper 
Cc: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index 326c2ed1d99b..51eb693df39b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long 
*flags)
 * would indicate some hardware/firmware is misbehaving and not
 * releasing it properly.
 */
-   if (err == -ETIMEDOUT) {
-   gt_err_ratelimited(gt, "hardware MCR steering semaphore timed 
out");
-   add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
unreliable */
-   }
+   if (err == -ETIMEDOUT)
+   gt_dbg(gt, "hardware MCR steering semaphore timed out");
  }
  
  /**


Are we sure this does not warrant a level higher than dbg, such as 
notice/warn? Because how can we be sure the two entities will not stomp 
on each other toes if we failed to obtain lock? (How can we be sure 
about "forgot to unlock" vs "in prolonged active use"? Or if we can be 
sure, can we force unlock and take the lock for the driver explicitly?)


Regards,

Tvrtko


[Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

2023-10-04 Thread Andi Shyti
The MCR steering semaphore is a shared lock entry between i915
and various firmware components.

Getting the lock might sinchronize on some shared resources.
Sometimes though, it might happen that the firmware forgets to
unlock causing unnecessary noise in the driver which keeps doing
what was supposed to do, ignoring the problem.

Do not consider this failure as an error, but just print a debug
message stating that the MCR locking has been skipped.

On the driver side we still have spinlocks that make sure that
the access to the resources is serialized.

Signed-off-by: Andi Shyti 
Cc: Jonathan Cavitt 
Cc: Matt Roper 
Cc: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index 326c2ed1d99b..51eb693df39b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long 
*flags)
 * would indicate some hardware/firmware is misbehaving and not
 * releasing it properly.
 */
-   if (err == -ETIMEDOUT) {
-   gt_err_ratelimited(gt, "hardware MCR steering semaphore timed 
out");
-   add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now 
unreliable */
-   }
+   if (err == -ETIMEDOUT)
+   gt_dbg(gt, "hardware MCR steering semaphore timed out");
 }
 
 /**
-- 
2.40.1