Re: [PATCH v9 1/9] drm/panic: Add drm panic locking

2024-03-14 Thread Daniel Vetter
On Fri, Mar 08, 2024 at 02:45:52PM +0100, Thomas Zimmermann wrote:
> Hi
> 
> Am 07.03.24 um 10:14 schrieb Jocelyn Falempe:
> > From: Daniel Vetter 
> > 
> > Rough sketch for the locking of drm panic printing code. The upshot of
> > this approach is that we can pretty much entirely rely on the atomic
> > commit flow, with the pair of raw_spin_lock/unlock providing any
> > barriers we need, without having to create really big critical
> > sections in code.
> 
> The ast driver has a lock to protect modesetting and concurrent EDID reads
> from each other. [1] That new panic_lock seems to serve the same purpose.
> 
> If we go that route, can we make this a bit more generic and call it
> commit_lock? I could then remove the dedicated lock from ast.

No, because the drm_panic_lock/unlock sections must be as small as
possible for two reasons:

- Anything we do while holding this lock that isn'st strictly needed for
  the panic code (like edid reading or more than the scanout address
  registers) will reduce the chances that the drm panic handler can run,
  since all we can do is try_lock.

- it's a raw spinlock, if you do more than a handful of instructions
  you'll really annoy the -rt people, because raw spinlocks are not
  converted to sleeping locks with the realtime config enabled. Reading an
  EDID (which takes upwards of tens of ms) is definitely about 4-5 orders
  of magnitudes to much at least.

The mutex is really the right lock here for protecting modesets against
edid reads here, and the panic lock is at most on top of that very, very
small and specific things.
-Sima

> 
> Best regards
> Thomas
> 
> [1] 
> https://elixir.bootlin.com/linux/v6.7/source/drivers/gpu/drm/ast/ast_drv.h#L195
> 
> > 
> > This also avoids the need that drivers must explicitly update the
> > panic handler state, which they might forget to do, or not do
> > consistently, and then we blow up in the worst possible times.
> > 
> > It is somewhat racy against a concurrent atomic update, and we might
> > write into a buffer which the hardware will never display. But there's
> > fundamentally no way to avoid that - if we do the panic state update
> > explicitly after writing to the hardware, we might instead write to an
> > old buffer that the user will barely ever see.
> > 
> > Note that an rcu protected deference of plane->state would give us the
> > the same guarantees, but it has the downside that we then need to
> > protect the plane state freeing functions with call_rcu too. Which
> > would very widely impact a lot of code and therefore doesn't seem
> > worth the complexity compared to a raw spinlock with very tiny
> > critical sections. Plus rcu cannot be used to protect access to
> > peek/poke registers anyway, so we'd still need it for those cases.
> > 
> > Peek/poke registers for vram access (or a gart pte reserved just for
> > panic code) are also the reason I've gone with a per-device and not
> > per-plane spinlock, since usually these things are global for the
> > entire display. Going with per-plane locks would mean drivers for such
> > hardware would need additional locks, which we don't want, since it
> > deviates from the per-console takeoverlocks design.
> > 
> > Longer term it might be useful if the panic notifiers grow a bit more
> > structure than just the absolute bare
> > EXPORT_SYMBOL(panic_notifier_list) - somewhat aside, why is that not
> > EXPORT_SYMBOL_GPL ... If panic notifiers would be more like console
> > drivers with proper register/unregister interfaces we could perhaps
> > reuse the very fancy console lock with all it's check and takeover
> > semantics that John Ogness is developing to fix the console_lock mess.
> > But for the initial cut of a drm panic printing support I don't think
> > we need that, because the critical sections are extremely small and
> > only happen once per display refresh. So generally just 60 tiny locked
> > sections per second, which is nothing compared to a serial console
> > running a 115kbaud doing really slow mmio writes for each byte. So for
> > now the raw spintrylock in drm panic notifier callback should be good
> > enough.
> > 
> > Another benefit of making panic notifiers more like full blown
> > consoles (that are used in panics only) would be that we get the two
> > stage design, where first all the safe outputs are used. And then the
> > dangerous takeover tricks are deployed (where for display drivers we
> > also might try to intercept any in-flight display buffer flips, which
> > if we race and misprogram fifos and watermarks can hang the memory
> > controller on some hw).
> > 
> > For context the actual implementation on the drm side is by Jocelyn
> > and this patch is meant to be combined with the overall approach in
> > v7 (v8 is a bit less flexible, which I think is the wrong direction):
> > 
> > https://lore.kernel.org/dri-devel/20240104160301.185915-1-jfale...@redhat.com/
> > 
> > Note that the locking is very much not correct there, hence this
> > 

Re: [PATCH v9 1/9] drm/panic: Add drm panic locking

2024-03-08 Thread Thomas Zimmermann

Hi

Am 07.03.24 um 10:14 schrieb Jocelyn Falempe:

From: Daniel Vetter 

Rough sketch for the locking of drm panic printing code. The upshot of
this approach is that we can pretty much entirely rely on the atomic
commit flow, with the pair of raw_spin_lock/unlock providing any
barriers we need, without having to create really big critical
sections in code.


The ast driver has a lock to protect modesetting and concurrent EDID 
reads from each other. [1] That new panic_lock seems to serve the same 
purpose.


If we go that route, can we make this a bit more generic and call it 
commit_lock? I could then remove the dedicated lock from ast.


Best regards
Thomas

[1] 
https://elixir.bootlin.com/linux/v6.7/source/drivers/gpu/drm/ast/ast_drv.h#L195




This also avoids the need that drivers must explicitly update the
panic handler state, which they might forget to do, or not do
consistently, and then we blow up in the worst possible times.

It is somewhat racy against a concurrent atomic update, and we might
write into a buffer which the hardware will never display. But there's
fundamentally no way to avoid that - if we do the panic state update
explicitly after writing to the hardware, we might instead write to an
old buffer that the user will barely ever see.

Note that an rcu protected deference of plane->state would give us the
the same guarantees, but it has the downside that we then need to
protect the plane state freeing functions with call_rcu too. Which
would very widely impact a lot of code and therefore doesn't seem
worth the complexity compared to a raw spinlock with very tiny
critical sections. Plus rcu cannot be used to protect access to
peek/poke registers anyway, so we'd still need it for those cases.

Peek/poke registers for vram access (or a gart pte reserved just for
panic code) are also the reason I've gone with a per-device and not
per-plane spinlock, since usually these things are global for the
entire display. Going with per-plane locks would mean drivers for such
hardware would need additional locks, which we don't want, since it
deviates from the per-console takeoverlocks design.

Longer term it might be useful if the panic notifiers grow a bit more
structure than just the absolute bare
EXPORT_SYMBOL(panic_notifier_list) - somewhat aside, why is that not
EXPORT_SYMBOL_GPL ... If panic notifiers would be more like console
drivers with proper register/unregister interfaces we could perhaps
reuse the very fancy console lock with all it's check and takeover
semantics that John Ogness is developing to fix the console_lock mess.
But for the initial cut of a drm panic printing support I don't think
we need that, because the critical sections are extremely small and
only happen once per display refresh. So generally just 60 tiny locked
sections per second, which is nothing compared to a serial console
running a 115kbaud doing really slow mmio writes for each byte. So for
now the raw spintrylock in drm panic notifier callback should be good
enough.

Another benefit of making panic notifiers more like full blown
consoles (that are used in panics only) would be that we get the two
stage design, where first all the safe outputs are used. And then the
dangerous takeover tricks are deployed (where for display drivers we
also might try to intercept any in-flight display buffer flips, which
if we race and misprogram fifos and watermarks can hang the memory
controller on some hw).

For context the actual implementation on the drm side is by Jocelyn
and this patch is meant to be combined with the overall approach in
v7 (v8 is a bit less flexible, which I think is the wrong direction):

https://lore.kernel.org/dri-devel/20240104160301.185915-1-jfale...@redhat.com/

Note that the locking is very much not correct there, hence this
separate rfc.

v2:
- fix authorship, this was all my typing
- some typo oopsies
- link to the drm panic work by Jocelyn for context

Signed-off-by: Daniel Vetter 
Cc: Jocelyn Falempe 
Cc: Andrew Morton 
Cc: "Peter Zijlstra (Intel)" 
Cc: Lukas Wunner 
Cc: Petr Mladek 
Cc: Steven Rostedt 
Cc: John Ogness 
Cc: Sergey Senozhatsky 
Cc: Maarten Lankhorst 
Cc: Maxime Ripard 
Cc: Thomas Zimmermann 
Cc: David Airlie 
Cc: Daniel Vetter 
---
  drivers/gpu/drm/drm_atomic_helper.c |  3 +
  drivers/gpu/drm/drm_drv.c   |  1 +
  include/drm/drm_mode_config.h   | 10 +++
  include/drm/drm_panic.h | 99 +
  4 files changed, 113 insertions(+)
  create mode 100644 include/drm/drm_panic.h

diff --git a/drivers/gpu/drm/drm_atomic_helper.c 
b/drivers/gpu/drm/drm_atomic_helper.c
index 39ef0a6addeb..c0bb91312fb2 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -38,6 +38,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -3099,6 +3100,7 @@ int drm_atomic_helper_swap_state(struct drm_atomic_state 
*state,
}
}
  
+	drm_panic_lock(state->dev);

   

Re: [PATCH v9 1/9] drm/panic: Add drm panic locking

2024-03-08 Thread Jocelyn Falempe




On 07/03/2024 11:27, John Ogness wrote:

On 2024-03-07, Jocelyn Falempe  wrote:

diff --git a/drivers/gpu/drm/drm_atomic_helper.c 
b/drivers/gpu/drm/drm_atomic_helper.c
index 39ef0a6addeb..c0bb91312fb2 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -38,6 +38,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -3099,6 +3100,7 @@ int drm_atomic_helper_swap_state(struct drm_atomic_state 
*state,
}
}
  
+	drm_panic_lock(state->dev);

for_each_oldnew_plane_in_state(state, plane, old_plane_state, 
new_plane_state, i) {
WARN_ON(plane->state != old_plane_state);
  
@@ -3108,6 +3110,7 @@ int drm_atomic_helper_swap_state(struct drm_atomic_state *state,

state->planes[i].state = old_plane_state;
plane->state = new_plane_state;
}
+   drm_panic_unlock(state->dev);


Is there a reason irqsave/irqrestore variants are not used? Maybe this
code path is too hot?


This lock will be taken for each page flip, so typically at 60Hz (or 
maybe 144Hz for gamers). I don't know what are the performance impacts 
of the irqsave/irqrestore variant.


By leaving interrupts enabled, there is the risk that a panic from
within any interrupt handler may block the drm panic handler.


The current design is that the panic handler will just use try_lock(), 
and if it can't take it, the panic screen will not be seen.
The goal is to make sure drm_panic won't crash the machine and prevent 
kdump or other panic handler to run. So there is a very small race 
chance that the panic screen won't be seen, but that's ok.


So I think in this case the drm panic handler shouldn't be blocked, as 
it only use try_lock().


Best regards,

--

Jocelyn



John Ogness





Re: [PATCH v9 1/9] drm/panic: Add drm panic locking

2024-03-07 Thread John Ogness
On 2024-03-07, Jocelyn Falempe  wrote:
> diff --git a/drivers/gpu/drm/drm_atomic_helper.c 
> b/drivers/gpu/drm/drm_atomic_helper.c
> index 39ef0a6addeb..c0bb91312fb2 100644
> --- a/drivers/gpu/drm/drm_atomic_helper.c
> +++ b/drivers/gpu/drm/drm_atomic_helper.c
> @@ -38,6 +38,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -3099,6 +3100,7 @@ int drm_atomic_helper_swap_state(struct 
> drm_atomic_state *state,
>   }
>   }
>  
> + drm_panic_lock(state->dev);
>   for_each_oldnew_plane_in_state(state, plane, old_plane_state, 
> new_plane_state, i) {
>   WARN_ON(plane->state != old_plane_state);
>  
> @@ -3108,6 +3110,7 @@ int drm_atomic_helper_swap_state(struct 
> drm_atomic_state *state,
>   state->planes[i].state = old_plane_state;
>   plane->state = new_plane_state;
>   }
> + drm_panic_unlock(state->dev);

Is there a reason irqsave/irqrestore variants are not used? Maybe this
code path is too hot?

By leaving interrupts enabled, there is the risk that a panic from
within any interrupt handler may block the drm panic handler.

John Ogness