Re: [PATCH] drm/bridge: aux-hpd-bridge: correct devm_drm_dp_hpd_bridge_add() stub

2024-05-11 Thread Johan Hovold
On Sat, May 11, 2024 at 11:54:30AM +0300, Dmitry Baryshkov wrote:
> If CONFIG_DRM_AUX_HPD_BRIDGE is not enabled, the aux-bridge.h header
> provides a stub for the bridge's functions. Correct the arguments list
> of one of those stubs to match the argument list of the non-stubbed
> function.
> 
> Fixes: e5ca263508f7 ("drm/bridge: aux-hpd: separate allocation and 
> registration")
> Reported-by: kernel test robot 
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202405110428.tmcfb1ut-...@intel.com/
> Cc: Johan Hovold 
> Signed-off-by: Dmitry Baryshkov 

Thanks for fixing this.

This one should also be backported (e.g. as the patch that can trigger
this probably also should be backported to 6.8 [1]):

Cc: sta...@vger.kernel.org  # 6.8
Reviewed-by: Johan Hovold 

Johan

[1] 
https://lore.kernel.org/r/20240424-qc-pmic-typec-hpd-split-v4-1-f7e10d147...@linaro.org
 


Re: [PATCH v3 1/2] drm/msm/dp: Add support for determining the eDP/DP mode from DT

2024-03-22 Thread Johan Hovold
On Fri, Mar 22, 2024 at 04:15:23PM +0200, Abel Vesa wrote:
> On 24-03-22 15:38:03, Dmitry Baryshkov wrote:
> > On Fri, 22 Mar 2024 at 15:36, Abel Vesa  wrote:
> > > On 24-03-22 15:30:54, Dmitry Baryshkov wrote:
> > > > On Fri, 22 Mar 2024 at 15:22, Abel Vesa  wrote:

> > > > > +static int dp_display_get_connector_type(struct platform_device 
> > > > > *pdev,
> > > > > +const struct msm_dp_desc 
> > > > > *desc)
> > > > > +{
> > > > > +   struct device *dev = >dev;
> > > > > +   struct device_node *aux_bus;
> > > > > +   struct device_node *panel;
> > > > > +   int ret = DRM_MODE_CONNECTOR_DisplayPort;
> > > > > +
> > > > > +   /* legacy platforms specify connector type in match data */
> > > > > +   if (desc->connector_type == DRM_MODE_CONNECTOR_eDP ||
> > > > > +   desc->connector_type == 
> > > > > DRM_MODE_CONNECTOR_DisplayPort)
> > > >
> > > > misaligned
> > > >
> > >
> > > Sure, will fix.
> > >
> > > > > +   return desc->connector_type;
> > > >
> > > > Can we drop this part completely?
> > > >
> > >
> > > You mean the whole if clause? How should we handle the legacy approach
> > > then?
> > 
> > Legacy platforms still have the aux-bus/panel. so they should be
> > handled by the check below.
> > 
> 
> Oh, in that case we can drop the connector_type from every desc for all
> platforms.

Guys, please trim your replies!

Johan


Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread

2024-03-21 Thread Johan Hovold
On Mon, Mar 18, 2024 at 11:01:25AM -0700, Abhinav Kumar wrote:
> On 3/15/2024 8:57 AM, Johan Hovold wrote:
> > On Thu, Mar 14, 2024 at 09:30:57AM -0700, Abhinav Kumar wrote:

> >> The race condition is between the time we get disconnect event and set
> >> link_ready to false, a commit can come in. Because setting link_ready to
> >> false happens in the event thread so it could be slightly delayed.
> > 
> > I get this part, just not why, or rather when, that becomes a problem.
> > 
> > Once the disconnect event is processed, dp_hpd_unplug_handle() will
> > update the state to ST_DISCONNECT_PENDING, and queue a notification
> > event. link_ready is (before this patch) still set to 1.

> This is the case I am thinking of:
> 
> 1) Disconnect event happens which will call dp_hpd_unplug_handle() but 
> link_ready is not false yet.
> 
> 2) There is a commit with a modeset, which shall trigger 
> atomic_disable() followed by an atomic_enable()
> 
> atomic_disable() will go through disable clocks and set hpd_state to 
> ST_DISCONNECTED.
> 
> 3) atomic_enable() will not go through because we will bail out because 
> state was ST_DISCONNECTED.
> 
>  if (state != ST_DISPLAY_OFF && state != ST_MAINLINK_READY) {
>  mutex_unlock(_display->event_mutex);
>  return;
>  }
> 
> 4) Now, if there is another commit with a modeset, it will go and crash 
> at atomic_disable()

Right, that's what I described in the mail you replied to but that still
doesn't answer what triggers those mode sets.
 
> > Here a commit comes in; what exactly are you suggesting would trigger
> > that? And in such a way that it breaks the state machine?

> Like we have seen, the commit can either come directly from userspace as 
> one last frame (the original bug I had given the link to) or from the 
> __drm_fb_helper_restore_fbdev_mode_unlocked() which happened in 
> sc8280xp's case. This is totally independent of the hpd_thread() with no 
> mutual exclusion.

Right . Still not sure about the details about "that last frame" issue,
that you saw in the past, and if that's still an issue or not. You
claimed that you had fixed that, right?

> This commit() can come before the link_ready was set to false. If it had 
> come after link_ready was set to false, atomic_check() would have failed 
> and no issue would have been seen.
> 
> My change is making the link_ready false sooner in the disconnect case.

Yes, but again, and as you have confirmed, you're only papering over the
issue at such a mode set can still come in before you set link_state to
false.
 
> > One way this could cause trouble is if you end up with a call to
> > dp_bridge_atomic_post_disable() which updates the state to
> > ST_DISCONNECTED. (1)
> > 
> > This would then need to be followed by another call to
> > dp_bridge_atomic_enable() which bails out early with the link clock
> > disabled. (2) (And if link_ready were to be set to 0 sooner, the
> > likelihood of this is reduced.)
> > 
> > This in turn, would trigger a reset when dp_bridge_atomic_disable() is
> > later called.

> Yes, this is exactly what I have written above.

Thanks for confirming.

> > This is the kind of description of the race I expect to see in the
> > commit message, and I'm still not sure what would trigger the call to
> > dp_bridge_atomic_post_disable() and dp_bridge_atomic_enable() (i.e. (1)
> > and (2) above) and whether this is a real issue or not.
> 
> I have explained what triggers the disable/enable call below.
> 
> > Also note that the above scenario is quite different from the one I've
> > hit and described earlier.

> Why is that so? Eventually it will also translate to the same scenario. 
> I would like to understand why this is different. I think in your case, 
> probably we do not know what triggers the modeset, but its a minor 
> detail like I have written before.

The state transitions are different and the enable event comes in
before the bridge has been fully tore down unlike in the scenario we
outlined above.

And it's certainly not a minor detail, as in the sc8280xp VT case,
those spurious hotplug events that trigger the atomic_enable would not
have caused any trouble if it wasn't for the case that the bridge was
stuck in the ST_MAINLINK_READY state.

That explains why the hotplug notification revert in rc7 made a
difference on sc8280xp. 

You're talking about an entirely different and, as far as I can tell,
hypothetical scenario where are user executes a modeset while pulling
the plug. This is certainly not why we had a number of user suddenly
starting to hit this crash after they upgraded to 6.8-rc1.

And, just 

Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread

2024-03-15 Thread Johan Hovold
On Thu, Mar 14, 2024 at 09:30:57AM -0700, Abhinav Kumar wrote:
> On 3/14/2024 8:38 AM, Johan Hovold wrote:
> > On Wed, Mar 13, 2024 at 10:24:08AM -0700, Abhinav Kumar wrote:

> > Perhaps I'm missing something in the race that you are trying to
> > describe (and which I've asked you to describe in more detail so that I
> > don't have to spend more time trying to come up with a reproducer
> > myself).

> The race condition is between the time we get disconnect event and set 
> link_ready to false, a commit can come in. Because setting link_ready to 
> false happens in the event thread so it could be slightly delayed.

I get this part, just not why, or rather when, that becomes a problem.

Once the disconnect event is processed, dp_hpd_unplug_handle() will
update the state to ST_DISCONNECT_PENDING, and queue a notification
event. link_ready is (before this patch) still set to 1.

Here a commit comes in; what exactly are you suggesting would trigger
that? And in such a way that it breaks the state machine?

One way this could cause trouble is if you end up with a call to
dp_bridge_atomic_post_disable() which updates the state to
ST_DISCONNECTED. (1)

This would then need to be followed by another call to
dp_bridge_atomic_enable() which bails out early with the link clock
disabled. (2) (And if link_ready were to be set to 0 sooner, the
likelihood of this is reduced.)

This in turn, would trigger a reset when dp_bridge_atomic_disable() is
later called.

This is the kind of description of the race I expect to see in the
commit message, and I'm still not sure what would trigger the call to
dp_bridge_atomic_post_disable() and dp_bridge_atomic_enable() (i.e. (1)
and (2) above) and whether this is a real issue or not.

Also note that the above scenario is quite different from the one I've
hit and described earlier.

> It will be hard to reproduce this. Only way I can think of is to delay 
> the EV_NOTIFICATION for sometime and see in dp_bridge_hpd_notify()
> 
>  else if (dp_display->link_ready && status == 
> connector_status_disconnected)
>  dp_add_event(dp, EV_HPD_UNPLUG_INT, 0, 0);
> 
> as dp_add_event() will add the event, then wakeup the event_q.

Sure that would increase the race window with the current code, but that
alone isn't enough to trigger the bug AFAICT.

> Before the event thread wakes up and processes this unplug event, the 
> commit can come in. This is the race condition i was thinking of.

Yes, but what triggers the commit? And why would it lead to a mode set
that disables the bridge?
 
Johan


Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread

2024-03-14 Thread Johan Hovold
On Wed, Mar 13, 2024 at 10:24:08AM -0700, Abhinav Kumar wrote:
> On 3/13/2024 1:18 AM, Johan Hovold wrote:

> > Right, but your proposed fix would not actually fix anything and judging
> > from the sparse commit message and diff itself it is clearly only meant
> > to mitigate the case where user space is involved, which is *not* the
> > case here.

> There can be a race condition between the time the DP driver gets the 
> hpd disconnect event and when the hpd thread processes that event 
> allowing the commit to sneak in. This is something which has always been 
> there even without pm_runtime series and remains even today.
> 
> In this race condition, the setting of "link_ready" to false can be a 
> bit delayed if we go through the HPD event processing increasing the 
> race condition window.
> 
> If link_ready is false, atomic_check() fails, thereby failing any 
> commits and hence not allowing the atomic_disable() / atomic_enable() 
> cycle and hence avoiding this reset.
> 
> The patch is moving the setting of link_ready to false earlier by not 
> putting it through the HPD event thread and hence trying to reduce the 
> window of the issue.

Perhaps I'm missing something in the race that you are trying to
describe (and which I've asked you to describe in more detail so that I
don't have to spend more time trying to come up with a reproducer
myself).

I do understand how your patch works, but my point is that it does
not fix the race that we are hitting on sc8280xp and, unless I'm missing
something, it is not even sufficient to fix the race you are talking
about as user space can still trigger that ioctl() before you clear the
link_ready flag.

That's why I said that it is only papering over the issue by making the
race window smaller (and this should also be highlighted in the commit
message).

For some reason it also made things worse on sc8280xp, but I did not
spend time on tracking down exactly why.

Johan


[PATCH 0/2] drm/msm/dp: fix runtime PM leaks on hotplug

2024-03-13 Thread Johan Hovold
As I've reported elsewhere, I've been hitting runtime PM usage count
leaks when investigated a DisplayPort hotplug regression on the Lenovo
ThinkPad X13s. [1]

This series addresses two obvious leaks on disconnect and on connect
failures, respectively.

Johan


[1] https://lore.kernel.org/lkml/ze8ke_m2xhypy...@hovoldconsulting.com/


Johan Hovold (2):
  drm/msm/dp: fix runtime PM leak on disconnect
  drm/msm/dp: fix runtime PM leak on connect failure

 drivers/gpu/drm/msm/dp/dp_display.c | 2 ++
 1 file changed, 2 insertions(+)

-- 
2.43.2



[PATCH 2/2] drm/msm/dp: fix runtime PM leak on connect failure

2024-03-13 Thread Johan Hovold
Make sure to balance the runtime PM usage counter (and suspend) before
returning on connect failures (e.g. DPCD read failures after a spurious
connect event or if link training fails).

Fixes: 5814b8bf086a ("drm/msm/dp: incorporate pm_runtime framework into DP 
driver")
Cc: sta...@vger.kernel.org  # 6.8
Cc: Kuogee Hsieh 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/dp/dp_display.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/msm/dp/dp_display.c 
b/drivers/gpu/drm/msm/dp/dp_display.c
index 8e8cf531da45..78464c395c3d 100644
--- a/drivers/gpu/drm/msm/dp/dp_display.c
+++ b/drivers/gpu/drm/msm/dp/dp_display.c
@@ -598,6 +598,7 @@ static int dp_hpd_plug_handle(struct dp_display_private 
*dp, u32 data)
ret = dp_display_usbpd_configure_cb(>dev);
if (ret) {  /* link train failed */
dp->hpd_state = ST_DISCONNECTED;
+   pm_runtime_put_sync(>dev);
} else {
dp->hpd_state = ST_MAINLINK_READY;
}
-- 
2.43.2



[PATCH 1/2] drm/msm/dp: fix runtime PM leak on disconnect

2024-03-13 Thread Johan Hovold
Make sure to put the runtime PM usage count (and suspend) also when
receiving a disconnect event while in the ST_MAINLINK_READY state.

This specifically avoids leaking a runtime PM usage count on every
disconnect with display servers that do not automatically enable
external displays when receiving a hotplug notification.

Fixes: 5814b8bf086a ("drm/msm/dp: incorporate pm_runtime framework into DP 
driver")
Cc: sta...@vger.kernel.org  # 6.8
Cc: Kuogee Hsieh 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/dp/dp_display.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/msm/dp/dp_display.c 
b/drivers/gpu/drm/msm/dp/dp_display.c
index 4c72124ffb5d..8e8cf531da45 100644
--- a/drivers/gpu/drm/msm/dp/dp_display.c
+++ b/drivers/gpu/drm/msm/dp/dp_display.c
@@ -655,6 +655,7 @@ static int dp_hpd_unplug_handle(struct dp_display_private 
*dp, u32 data)
dp_display_host_phy_exit(dp);
dp->hpd_state = ST_DISCONNECTED;
dp_display_notify_disconnect(>dp_display.pdev->dev);
+   pm_runtime_put_sync(>dev);
mutex_unlock(>event_mutex);
return 0;
}
-- 
2.43.2



Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread

2024-03-13 Thread Johan Hovold
On Tue, Mar 12, 2024 at 10:39:46AM -0700, Abhinav Kumar wrote:
> On 3/12/2024 9:59 AM, Johan Hovold wrote:

> >> Heh. This is getting ridiculous. I just tried running with this patch
> >> and it again breaks hotplug detect in a VT console and in X (where I
> >> could enable a reconnected external display by running xrandr twice
> >> before).
> >>
> >> So, please, do not apply this one.
> > 
> > To make things worse, I indeed also hit the reset when disconnecting
> > after such a failed hotplug.

> Ack, I will hold off till I analyze your issues more which you have 
> listed in separate replies. Especially about the spurious connect, I 
> believe you are trying to mention that, by adding logs, you are able to 
> delay the processing of a connect event to *make* it like a spurious 
> one? In case, I got this part wrong, can you pls explain the spurious 
> connect scenario again?

No, I only mentioned the debug printks in passing as instrumentation
like that may affect race conditions (but I'm also hitting the resets
also with no printks in place).

The spurious connect event comes directly from the pmic firmware, and
even if we may optimise things by implementing some kind of debounce,
the hotplug implementation needs to be robust enough to not kill the
machine if such an event gets through.

Basically what I see is that during physical disconnect there can be
multiple hpd notify events (e.g. connect, disconnect, connect):

[  146.910195] usb 5-1: USB disconnect, device number 4
[  146.931026] msm-dp-display ae98000.displayport-controller: 
dp_bridge_hpd_notify - link_ready = 1, status = 2
[  146.934785] msm-dp-display ae98000.displayport-controller: 
dp_hpd_unplug_handle
[  146.938114] msm-dp-display ae98000.displayport-controller: 
dp_bridge_hpd_notify - link_ready = 1, status = 1
[  146.940245] [CONNECTOR:35:DP-2] status updated from disconnected to connected
[  146.955193] msm-dp-display ae98000.displayport-controller: 
dp_bridge_hpd_notify - link_ready = 0, status = 2

And it is the spurious connect event while the link is being tore down
that triggers the hotplug processing that leads to the reset.

Similarly, I've seen spurious disconnect events while the plug in being
inserted.

> A short response on why this change was made is that commit can be 
> issued by userspace or the fbdev client. So userspace involvement only 
> makes commit happen from a different path. It would be incorrect to 
> assume the issues from the earlier bug and the current one are different 
> only because there was userspace involvement in that one and not this.
>
> Because in the end, it manifests itself in the same way that 
> atomic_enable() did not go through after an atomic_disable() and the 
> next atomic_disable() crashes.

Right, but your proposed fix would not actually fix anything and judging
from the sparse commit message and diff itself it is clearly only meant
to mitigate the case where user space is involved, which is *not* the
case here.

Johan


Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread

2024-03-12 Thread Johan Hovold
On Tue, Mar 12, 2024 at 05:41:23PM +0100, Johan Hovold wrote:
> On Tue, Mar 12, 2024 at 11:09:11AM +0100, Johan Hovold wrote:
> > On Fri, Mar 08, 2024 at 01:45:32PM -0800, Abhinav Kumar wrote:
> 
> > > @@ -466,6 +466,8 @@ static int dp_display_notify_disconnect(struct device 
> > > *dev)
> > >  {
> > >   struct dp_display_private *dp = dev_get_dp_display_private(dev);
> > >  
> > > + dp->dp_display.link_ready = false;
> > 
> > As I also pointed out in the other thread, setting link_ready to false
> > here means that any spurious connect event (during physical disconnect)
> > will always be processed, something which can currently lead to a leaked
> > runtime pm reference.
> > 
> > Wasting some power is of course preferred over crashing the machine, but
> > please take it into consideration anyway.
> > 
> > Especially if your intention with this patch was to address the resets
> > we saw with sc8280xp which are gone since the HPD notify revert (which
> > fixed the hotplug detect issue that left the bridge in a
> > half-initialised state).
> 
> Heh. This is getting ridiculous. I just tried running with this patch
> and it again breaks hotplug detect in a VT console and in X (where I
> could enable a reconnected external display by running xrandr twice
> before).
> 
> So, please, do not apply this one.

To make things worse, I indeed also hit the reset when disconnecting
after such a failed hotplug.

Johan


Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread

2024-03-12 Thread Johan Hovold
On Tue, Mar 12, 2024 at 11:09:11AM +0100, Johan Hovold wrote:
> On Fri, Mar 08, 2024 at 01:45:32PM -0800, Abhinav Kumar wrote:

> > @@ -466,6 +466,8 @@ static int dp_display_notify_disconnect(struct device 
> > *dev)
> >  {
> > struct dp_display_private *dp = dev_get_dp_display_private(dev);
> >  
> > +   dp->dp_display.link_ready = false;
> 
> As I also pointed out in the other thread, setting link_ready to false
> here means that any spurious connect event (during physical disconnect)
> will always be processed, something which can currently lead to a leaked
> runtime pm reference.
> 
> Wasting some power is of course preferred over crashing the machine, but
> please take it into consideration anyway.
> 
> Especially if your intention with this patch was to address the resets
> we saw with sc8280xp which are gone since the HPD notify revert (which
> fixed the hotplug detect issue that left the bridge in a
> half-initialised state).

Heh. This is getting ridiculous. I just tried running with this patch
and it again breaks hotplug detect in a VT console and in X (where I
could enable a reconnected external display by running xrandr twice
before).

So, please, do not apply this one.

Johan


Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread

2024-03-12 Thread Johan Hovold
On Fri, Mar 08, 2024 at 01:45:32PM -0800, Abhinav Kumar wrote:
> There are cases where the userspace might still send another
> frame after the HPD disconnect causing a modeset cycle after
> a disconnect. This messes the internal state machine of MSM DP driver
> and can lead to a crash as there can be an imbalance between
> bridge_disable() and bridge_enable().

Can you be more specific here? What steps would lead to this issue and
how exactly does is mess with the state machine? Is there an easy way
to reproduce it (e.g. by instrumenting the code with some sleep)?

The hotplug code is really convoluted and having a clear description of
the problem is needed to evaluate the patch (including when revisiting
it some time from now when I've forgotten about how this state machine
works).

As you know, we ran into a related issue on sc8280xp (X13s) since
6.8-rc1, but that did not involve any user space interaction at all.

For reference, there are some more details in this thread:

https://lore.kernel.org/all/ze8ke_m2xhypy...@hovoldconsulting.com/
 
> This was also previously reported on [1] for which [2] was posted
> and helped resolve the issue by rejecting commits if the DP is not
> in connected state.
> 
> The change resolved the bug but there can also be another race condition.
> If hpd_event_thread does not pick up the EV_USER_NOTIFICATION and process it
> link_ready will also not be set to false allowing the frame to sneak in.

How could the event thread fail to pick up the notification event? Or do
you mean there's a race window before it has been processed?

> Lets move setting link_ready outside of hpd_event_thread() processing to
> eliminate a window of race condition.

As we discussed in thread above, this patch does not eliminate the race,
even if it may reduce the race window.
 
> [1] : https://gitlab.freedesktop.org/drm/msm/-/issues/17
> [2] : 
> https://lore.kernel.org/all/1664408211-25314-1-git-send-email-quic_khs...@quicinc.com/
> 
> Fixes: 8a3b4c17f863 ("drm/msm/dp: employ bridge mechanism for display enable 
> and disable")
> Signed-off-by: Abhinav Kumar 

> @@ -466,6 +466,8 @@ static int dp_display_notify_disconnect(struct device 
> *dev)
>  {
>   struct dp_display_private *dp = dev_get_dp_display_private(dev);
>  
> + dp->dp_display.link_ready = false;

As I also pointed out in the other thread, setting link_ready to false
here means that any spurious connect event (during physical disconnect)
will always be processed, something which can currently lead to a leaked
runtime pm reference.

Wasting some power is of course preferred over crashing the machine, but
please take it into consideration anyway.

Especially if your intention with this patch was to address the resets
we saw with sc8280xp which are gone since the HPD notify revert (which
fixed the hotplug detect issue that left the bridge in a
half-initialised state).

> +
>   dp_add_event(dp, EV_USER_NOTIFICATION, false, 0);
>  
>   return 0;

Johan


Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

2024-03-12 Thread Johan Hovold
On Mon, Mar 11, 2024 at 09:51:29AM -0700, Abhinav Kumar wrote:
> On 3/11/2024 6:43 AM, Johan Hovold wrote:
> > On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote:
> >> On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote:

> >>> I have posted my analysis with the patch here as a RFC y'day:
> >>>
> >>> https://patchwork.freedesktop.org/patch/581758/

> > I was able to reproduce the reset with some of my debug printks in place
> > after reapplying the reverted hpd notify change so I have an explanation
> > for (one of) the ways we can up in this state now.
> > 
> > This does not match description of the problem in the fix linked to
> > above and I don't think that patch solves the underlying issue even if
> > it may make the race window somewhat smaller. More details below.

> Its the same condition you described below that enable does not go 
> through and we bail out as we are in ST_DISCONNECTED.

It's closely related but clearly not the same as user space is not
involved in the reset I see.
 
> > Turns out there are paths like that and we hit those more often before
> > reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify().
> > 
> > Specifically, when a previous connect attempt did not enable the bridge
> > fully so that it is still in the ST_MAINLINK_READY when we receive a
> > disconnect event, dp_hpd_unplug_handle() will turn of the link clock.
> > 
> > [  204.527625] msm-dp-display ae98000.displayport-controller: 
> > dp_bridge_hpd_notify - link_ready = 1, status = 2
> > [  204.531553] msm-dp-display ae98000.displayport-controller: 
> > dp_hpd_unplug_handle
> > [  204.533261] msm-dp-display ae98000.displayport-controller: 
> > dp_ctrl_off_link
> > 
> > A racing connect event, such as the one I described earlier, can then
> > try to enable the bridge again but dp_bridge_atomic_enable() just bails
> > out early (and leaks a rpm reference) because we're now in
> > ST_DISCONNECTED:
> > 
> > [  204.535773] msm-dp-display ae98000.displayport-controller: 
> > dp_bridge_hpd_notify - link_ready = 1, status = 1
> > [  204.536187] [CONNECTOR:35:DP-2] status updated from disconnected to 
> > connected
> > [  204.536905] msm-dp-display ae98000.displayport-controller: 
> > dp_display_notify_disconnect - would clear link ready (1), state = 0
> > [  204.537821] msm-dp-display ae98000.displayport-controller: 
> > dp_bridge_atomic_check - link_ready = 1
> > [  204.538063] msm-dp-display ae98000.displayport-controller: 
> > dp_display_send_hpd_notification - hpd = 0, link_ready = 1
> > [  204.542778] msm-dp-display ae98000.displayport-controller: 
> > dp_bridge_atomic_enable
> > [  204.586547] msm-dp-display ae98000.displayport-controller: 
> > dp_bridge_atomic_enable - state = 0 (rpm leak?)
> > 
> > Clearing link_ready already in dp_display_notify_disconnect() would make
> > the race window slightly smaller, but it would essentially just paper
> > over the bug as the events are still not serialised. Notably, there is
> > no user space interaction involved here and it's the spurious connect
> > event that triggers the bridge enable.

> Yes, it only narrows down the race condition window. The issue can still 
> happen if the commit / modeset was issued before we marked link_ready as 
> false.
> 
> And yes, I was only targetting a short term fix till we rework the HPD. 
> That will happen only incrementally as its a delicate piece of code.

Ok, thanks for confirming. Please also make that clear in the commit
message of any patch.

I am however not sure that your patch (RFC) is needed at this point as
the HPD revert fixes the 6.8-rc1 regression, and moving the clearing of
link_ready can actually make things worse as it makes any spurious
hotplug event always be processed (not just if they happen after
dp_display_send_hpd_notification()).

I'll reply to you patch as well.
 
> > So, while it may still be theoretically possible to hit the resets after
> > the revert, the HPD notify revert effectively "fixed" the regression in
> > 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the
> > half-initialised bridge).

> Not entirely. In the bug which was reported before the patch 
> e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() got landed, its a 
> classic example of how this issue can happen with userspace involvement 
> and not just fbdev client which was your case.

Sure, but you already added some kind of band-aid for that issue, right?


https://lore.kernel.org/all/1664408211-25314-1-git-send-email-quic_khs...@quicinc.com/

How likely is it that you'd stil

Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

2024-03-11 Thread Johan Hovold
On Mon, Mar 11, 2024 at 02:43:24PM +0100, Johan Hovold wrote:

> So, while it may still be theoretically possible to hit the resets after
> the revert, the HPD notify revert effectively "fixed" the regression in
> 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the
> half-initialised bridge).
> 
> It seems the hotplug state machine needs to be reworked completely, but
> at least we're roughly back where we were with 6.7 (including that the
> bus clocks will never be turned of because of the rpm leaks on
> disconnect).

#regzbot introduced: e467e0bde881
#regzbot fix: 664bad6af3cb


Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

2024-03-11 Thread Johan Hovold
On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote:
> On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote:
> > On 3/8/2024 4:43 AM, Johan Hovold wrote:
> 
> > For this last remaining reset with the stacktrace you have mentioned 
> > below, I do not think this was introduced due to PM runtime series. We 
> > have had this report earlier with the same stacktrace as well in earlier 
> > kernels which didnt have PM runtime support:
> 
> > But, it seems like there is another race condition in this code 
> > resulting in this issue again.
> > 
> > I have posted my analysis with the patch here as a RFC y'day:
> > 
> > https://patchwork.freedesktop.org/patch/581758/
> > 
> > I missed CCing you and Bjorn on the RFC but when I post it as a patch 
> > either today/tomm, will CC you both.
> 
> Ok, thanks. I'll take a closer look at that next week. It's not clear to
> me what that race looks like after reading the commit message. It would
> be good to have some more details in there (e.g. the sequence of events
> that breaks the state machine) and some way to reproduce the issue
> reliably.

I was able to reproduce the reset with some of my debug printks in place
after reapplying the reverted hpd notify change so I have an explanation
for (one of) the ways we can up in this state now.

This does not match description of the problem in the fix linked to
above and I don't think that patch solves the underlying issue even if
it may make the race window somewhat smaller. More details below.
 
> > > We now also have Bjorn's call trace from a different Qualcomm platform
> > > triggering an unclocked access on disconnect, something which would
> > > trigger a reset by the hypervisor on sc8280xp platforms like the X13s:
> 
> > > [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt
> > > [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 
> > > 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
> > > [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics 
> > > RB3gen2 (DT)
> > > [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper]
> > > [ 2030.379642] Call trace:
> 
> > > [ 2030.379722]  wait_for_completion_timeout+0x14/0x20
> > > [ 2030.379727]  dp_ctrl_push_idle+0x34/0x8c [msm]
> > > [ 2030.379844]  dp_bridge_atomic_disable+0x18/0x24 [msm]
> > > [ 2030.379959]  drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm]
> > > [ 2030.380150]  drm_atomic_helper_commit_modeset_disables+0x174/0x57c 
> > > [drm_kms_helper]
> > > [ 2030.380200]  msm_atomic_commit_tail+0x1b4/0x474 [msm]
> > > [ 2030.380316]  commit_tail+0xa4/0x158 [drm_kms_helper]
> > > [ 2030.380369]  drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper]
> > > [ 2030.380418]  drm_atomic_commit+0xa8/0xd4 [drm]
> > > [ 2030.380529]  drm_client_modeset_commit_atomic+0x16c/0x244 [drm]
> > > [ 2030.380641]  drm_client_modeset_commit_locked+0x50/0x168 [drm]
> > > [ 2030.380753]  drm_client_modeset_commit+0x2c/0x54 [drm]
> > > [ 2030.380865]  __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 
> > > [drm_kms_helper]
> > > [ 2030.380915]  drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper]
> > > [ 2030.380965]  msm_fbdev_client_hotplug+0x28/0xc8 [msm]
> > > [ 2030.381081]  drm_client_dev_hotplug+0x94/0x118 [drm]
> > > [ 2030.381192]  output_poll_execute+0x214/0x26c [drm_kms_helper]
> > > [ 2030.381241]  process_scheduled_works+0x19c/0x2cc
> > > [ 2030.381249]  worker_thread+0x290/0x3cc
> > > [ 2030.381255]  kthread+0xfc/0x184
> > > [ 2030.381260]  ret_from_fork+0x10/0x20
> > > 
> > > The above could happen if the convoluted hotplug state machine breaks
> > > down so that the device is runtime suspended before
> > > dp_bridge_atomic_disable() is called.
> 
> > Yes, state machine got broken and I have explained how in the commit 
> > text of the RFC. But its not necessarily due to PM runtime but a 
> > sequence of events happening from userspace exposing this breakage.
> 
> After looking at this some more today, I agree with you. The
> observations I've reported are consistent with what would happen if the
> link clock is disabled when dp_bridge_atomic_disable() is called.
> 
> That clock is disabled in dp_bridge_atomic_post_disable() before runtime
> suspending, but perhaps there are some further paths that can end up
> disabling it.

Turns out there are paths like that and we hit those more often before
reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify().

Specifically, when a previous 

Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

2024-03-09 Thread Johan Hovold
On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote:
> On 3/8/2024 4:43 AM, Johan Hovold wrote:

> For this last remaining reset with the stacktrace you have mentioned 
> below, I do not think this was introduced due to PM runtime series. We 
> have had this report earlier with the same stacktrace as well in earlier 
> kernels which didnt have PM runtime support:

> But, it seems like there is another race condition in this code 
> resulting in this issue again.
> 
> I have posted my analysis with the patch here as a RFC y'day:
> 
> https://patchwork.freedesktop.org/patch/581758/
> 
> I missed CCing you and Bjorn on the RFC but when I post it as a patch 
> either today/tomm, will CC you both.

Ok, thanks. I'll take a closer look at that next week. It's not clear to
me what that race looks like after reading the commit message. It would
be good to have some more details in there (e.g. the sequence of events
that breaks the state machine) and some way to reproduce the issue
reliably.

> > We now also have Bjorn's call trace from a different Qualcomm platform
> > triggering an unclocked access on disconnect, something which would
> > trigger a reset by the hypervisor on sc8280xp platforms like the X13s:

> > [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt
> > [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 
> > 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
> > [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 
> > (DT)
> > [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper]
> > [ 2030.379642] Call trace:

> > [ 2030.379722]  wait_for_completion_timeout+0x14/0x20
> > [ 2030.379727]  dp_ctrl_push_idle+0x34/0x8c [msm]
> > [ 2030.379844]  dp_bridge_atomic_disable+0x18/0x24 [msm]
> > [ 2030.379959]  drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm]
> > [ 2030.380150]  drm_atomic_helper_commit_modeset_disables+0x174/0x57c 
> > [drm_kms_helper]
> > [ 2030.380200]  msm_atomic_commit_tail+0x1b4/0x474 [msm]
> > [ 2030.380316]  commit_tail+0xa4/0x158 [drm_kms_helper]
> > [ 2030.380369]  drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper]
> > [ 2030.380418]  drm_atomic_commit+0xa8/0xd4 [drm]
> > [ 2030.380529]  drm_client_modeset_commit_atomic+0x16c/0x244 [drm]
> > [ 2030.380641]  drm_client_modeset_commit_locked+0x50/0x168 [drm]
> > [ 2030.380753]  drm_client_modeset_commit+0x2c/0x54 [drm]
> > [ 2030.380865]  __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 
> > [drm_kms_helper]
> > [ 2030.380915]  drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper]
> > [ 2030.380965]  msm_fbdev_client_hotplug+0x28/0xc8 [msm]
> > [ 2030.381081]  drm_client_dev_hotplug+0x94/0x118 [drm]
> > [ 2030.381192]  output_poll_execute+0x214/0x26c [drm_kms_helper]
> > [ 2030.381241]  process_scheduled_works+0x19c/0x2cc
> > [ 2030.381249]  worker_thread+0x290/0x3cc
> > [ 2030.381255]  kthread+0xfc/0x184
> > [ 2030.381260]  ret_from_fork+0x10/0x20
> > 
> > The above could happen if the convoluted hotplug state machine breaks
> > down so that the device is runtime suspended before
> > dp_bridge_atomic_disable() is called.

> Yes, state machine got broken and I have explained how in the commit 
> text of the RFC. But its not necessarily due to PM runtime but a 
> sequence of events happening from userspace exposing this breakage.

After looking at this some more today, I agree with you. The
observations I've reported are consistent with what would happen if the
link clock is disabled when dp_bridge_atomic_disable() is called.

That clock is disabled in dp_bridge_atomic_post_disable() before runtime
suspending, but perhaps there are some further paths that can end up
disabling it.

> > For some reason, possibly due to unrelated changes in timing, possibly
> > after the hotplug revert, I am no longer able to reproduce the reset
> > with 6.8-rc7 on the X13s.

> I do not know how the hotplug revert fixed this stack because I think 
> this can still happen.

Thanks for confirming.
 
> > I am however able to get the hotplug state machine to leak
> > runtime PM reference counts which prevents it from ever being suspended
> > (e.g. by disconnecting slowly so that we get multiple connect and
> > disconnect events). This can manifest itself as a hotplug event which is
> > processed after disconnecting the display:
> > 
> > [drm:dp_panel_read_sink_caps [msm]] *ERROR* read dpcd failed -110

> hmm ... this is another new report. I was not aware of this till this 
> instance. We would like to debug this issue too as we have also not seen 
> this one before.
> 
> But, I am suspicious about how or why a refco

Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

2024-03-08 Thread Johan Hovold
Hi Abhinav, Rob, Dmitry and Kuogee,

On Tue, Feb 27, 2024 at 02:33:48PM +0100, Johan Hovold wrote:

> Since 6.8-rc1 I have seen (and received reports) of hard resets of the
> Lenovo ThinkPad X13s after connecting and disconnecting an external
> display.
> 
> I have triggered this on a simple disconnect while in a VT console, but
> also when stopping Xorg after having repeatedly connected and
> disconnected an external display and tried to enable it using xrandr.
> 
> In the former case, the last (custom debug) messages printed over an SSH
> session were once:
> 
> [  948.416358] usb 5-1: USB disconnect, device number 3
> [  948.443496] msm_dpu ae01000.display-controller: 
> msm_fbdev_client_hotplug
> [  948.443723] msm-dp-display ae98000.displayport-controller: 
> dp_power_clk_enable - type = 1, enable = 0
> [  948.443872] msm-dp-display ae98000.displayport-controller: 
> dp_ctrl_phy_exit
> [  948.445117] msm-dp-display ae98000.displayport-controller: 
> dp_ctrl_phy_exit - done
> 
> and then the hypervisor resets the machine.

Has there been any progress on tracking down the reset on disconnect
issue? I was expecting you to share your findings here so that we can
determine if the rest of the runtime PM series needs to be reverted or
not before 6.8 is released (possibly on Sunday).

I really did not want to spend more on this driver than I already have
this cycle, but the lack of (visible) progress has again forced me to do
so.

It's quite likely that the resets are indeed a regression caused by the
runtime PM series as the bus clocks were not disabled on disconnect
before that one was merged in 6.8-rc1.

In a VT console, the device is now runtime suspended immediately on
disconnect, while in X, it currently remains active until X is killed,
which is consistent with what I reported above.

We now also have Bjorn's call trace from a different Qualcomm platform
triggering an unclocked access on disconnect, something which would
trigger a reset by the hypervisor on sc8280xp platforms like the X13s:

[ 2030.379417] SError Interrupt on CPU0, code 0xbe00 -- SError
[ 2030.379425] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 
6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
[ 2030.379430] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
[ 2030.379435] Workqueue: events output_poll_execute [drm_kms_helper]
[ 2030.379495] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 2030.379501] pc : el1_interrupt+0x28/0xc0
[ 2030.379514] lr : el1h_64_irq_handler+0x18/0x24
[ 2030.379520] sp : 80008129b700
[ 2030.379523] x29: 80008129b700 x28: 7a0a031aa200 x27: 7a0af768c3c0
[ 2030.379530] x26:  x25: 7a0a024bb568 x24: 7a0a031aa200
[ 2030.379537] x23: 6045 x22: d2a4a10c871c x21: 80008129b8a0
[ 2030.379544] x20: d2a4a2f0 x19: 80008129b750 x18: 
[ 2030.379549] x17:  x16: d2a4a10c21d4 x15: 
[ 2030.379555] x14:  x13: 0001acf6 x12: 
[ 2030.379560] x11: 0001 x10: 7a0af623f680 x9 : 00010001
[ 2030.379565] x8 : 00c0 x7 :  x6 : 003f
[ 2030.379570] x5 : 7a0a003c6a70 x4 : 001f x3 : d2a48d6722dc
[ 2030.379576] x2 : 0002 x1 : d2a4a2f0 x0 : 80008129b750
[ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 
6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
[ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
[ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper]
[ 2030.379642] Call trace:
[ 2030.379644]  dump_backtrace+0xec/0x108
[ 2030.379654]  show_stack+0x18/0x24
[ 2030.379659]  dump_stack_lvl+0x40/0x84
[ 2030.379664]  dump_stack+0x18/0x24
[ 2030.379668]  panic+0x130/0x34c
[ 2030.379673]  nmi_panic+0x44/0x90
[ 2030.379679]  arm64_serror_panic+0x68/0x74
[ 2030.379683]  do_serror+0xc4/0xcc
[ 2030.379686]  el1h_64_error_handler+0x34/0x4c
[ 2030.379692]  el1h_64_error+0x64/0x68
[ 2030.379696]  el1_interrupt+0x28/0xc0
[ 2030.379700]  el1h_64_irq_handler+0x18/0x24
[ 2030.379706]  el1h_64_irq+0x64/0x68
[ 2030.379710]  _raw_spin_unlock_irq+0x20/0x48
[ 2030.379718]  wait_for_common+0xb4/0x16c
[ 2030.379722]  wait_for_completion_timeout+0x14/0x20
[ 2030.379727]  dp_ctrl_push_idle+0x34/0x8c [msm]
[ 2030.379844]  dp_bridge_atomic_disable+0x18/0x24 [msm]
[ 2030.379959]  drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm]
[ 2030.380150]  drm_atomic_helper_commit_modeset_disables+0x174/0x57c 
[drm_kms_helper]
[ 2030.380200]  msm_atomic_commit_tail+0x1b4/0x474 [msm]
[ 2030.380316]  commit_tail+0xa4/0x158 [drm_kms_helper]
[ 2030.380369]  drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper]
[ 2030.380418]  drm_atomic_commit+0xa8

[PATCH stable-6.7] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free

2024-03-08 Thread Johan Hovold
commit b979f2d50a099f3402418d7ff5f26c3952fb08bb upstream.

A recent DRM series purporting to simplify support for "transparent
bridges" and handling of probe deferrals ironically exposed a
use-after-free issue on pmic_glink_altmode probe deferral.

This has manifested itself as the display subsystem occasionally failing
to initialise and NULL-pointer dereferences during boot of machines like
the Lenovo ThinkPad X13s.

Specifically, the dp-hpd bridge is currently registered before all
resources have been acquired which means that it can also be
deregistered on probe deferrals.

In the meantime there is a race window where the new aux bridge driver
(or PHY driver previously) may have looked up the dp-hpd bridge and
stored a (non-reference-counted) pointer to the bridge which is about to
be deallocated.

When the display controller is later initialised, this triggers a
use-after-free when attaching the bridges:

dp -> aux -> dp-hpd (freed)

which may, for example, result in the freed bridge failing to attach:

[drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge 
/soc@0/phy@88eb000 to encoder TMDS-31: -16

or a NULL-pointer dereference:

Unable to handle kernel NULL pointer dereference at virtual address 

...
Call trace:
  drm_bridge_attach+0x70/0x1a8 [drm]
  drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
  drm_bridge_attach+0x80/0x1a8 [drm]
  dp_bridge_init+0xa8/0x15c [msm]
  msm_dp_modeset_init+0x28/0xc4 [msm]

The DRM bridge implementation is clearly fragile and implicitly built on
the assumption that bridges may never go away. In this case, the fix is
to move the bridge registration in the pmic_glink_altmode driver to
after all resources have been looked up.

Incidentally, with the new dp-hpd bridge implementation, which registers
child devices, this is also a requirement due to a long-standing issue
in driver core that can otherwise lead to a probe deferral loop (see
commit fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")).

[DB: slightly fixed commit message by adding the word 'commit']
Fixes: 080b4e24852b ("soc: qcom: pmic_glink: Introduce altmode support")
Fixes: 2bcca96abfbf ("soc: qcom: pmic-glink: switch to DRM_AUX_HPD_BRIDGE")
Cc:   # 6.3
Cc: Bjorn Andersson 
Cc: Dmitry Baryshkov 
Signed-off-by: Johan Hovold 
Reviewed-by: Bjorn Andersson 
Reviewed-by: Dmitry Baryshkov 
Signed-off-by: Dmitry Baryshkov 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20240217150228.5788-4-johan+lin...@kernel.org
[ johan: backport to 6.7 which does not have DRM aux bridge ]
Signed-off-by: Johan Hovold 
---
 drivers/soc/qcom/pmic_glink_altmode.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/soc/qcom/pmic_glink_altmode.c 
b/drivers/soc/qcom/pmic_glink_altmode.c
index ad922f0dca6b..a890fafdafb8 100644
--- a/drivers/soc/qcom/pmic_glink_altmode.c
+++ b/drivers/soc/qcom/pmic_glink_altmode.c
@@ -469,12 +469,6 @@ static int pmic_glink_altmode_probe(struct 
auxiliary_device *adev,
alt_port->bridge.ops = DRM_BRIDGE_OP_HPD;
alt_port->bridge.type = DRM_MODE_CONNECTOR_DisplayPort;
 
-   ret = devm_drm_bridge_add(dev, _port->bridge);
-   if (ret) {
-   fwnode_handle_put(fwnode);
-   return ret;
-   }
-
alt_port->dp_alt.svid = USB_TYPEC_DP_SID;
alt_port->dp_alt.mode = USB_TYPEC_DP_MODE;
alt_port->dp_alt.active = 1;
@@ -525,6 +519,16 @@ static int pmic_glink_altmode_probe(struct 
auxiliary_device *adev,
}
}
 
+   for (port = 0; port < ARRAY_SIZE(altmode->ports); port++) {
+   alt_port = >ports[port];
+   if (!alt_port->altmode)
+   continue;
+
+   ret = devm_drm_bridge_add(dev, _port->bridge);
+   if (ret)
+   return ret;
+   }
+
altmode->client = devm_pmic_glink_register_client(dev,
  altmode->owner_id,
  
pmic_glink_altmode_callback,
-- 
2.43.0



Re: [PATCH] Revert "drm/msm/dp: use drm_bridge_hpd_notify() to report HPD status changes"

2024-02-28 Thread Johan Hovold
On Wed, Feb 28, 2024 at 10:10:10AM -0800, Abhinav Kumar wrote:
> On 2/28/2024 5:28 AM, Johan Hovold wrote:

> > This is a fix for a user-visible regression that was reported formally
> > two weeks ago and informally (e.g. to you) soon after rc1 came out, and
> > which now also has an identified cause and an analysis of the problem.
> > And we're at rc6 so there should be no reason to delay fixing this (e.g.
> > even if you want to run some more tests for a couple of days).
> 
> Yup, we landed it in msm-fixes now, so this will go as a late -fixes PR 
> for 6.8.

Perfect, thanks!

Johan


Re: [PATCH] Revert "drm/msm/dp: use drm_bridge_hpd_notify() to report HPD status changes"

2024-02-28 Thread Johan Hovold
On Wed, Feb 28, 2024 at 01:08:04PM +0200, Dmitry Baryshkov wrote:
> On Wed, 28 Feb 2024 at 11:50, Johan Hovold  wrote:
> >
> > On Tue, Feb 27, 2024 at 02:11:56PM -0800, Abhinav Kumar wrote:
> > > On 2/27/2024 2:08 PM, Dmitry Baryshkov wrote:
> > > > This reverts commit e467e0bde881 ("drm/msm/dp: use
> > > > drm_bridge_hpd_notify() to report HPD status changes").
> > > >
> > > > The commit changed the way how the MSM DP driver communicates
> > > > HPD-related events to the userspace. The mentioned commit made some of
> > > > the HPD events being reported earlier. This way userspace starts poking
> > > > around. It interacts in a bad way with the dp_bridge_detect and the
> > > > driver's state machine, ending up either with the very long delays
> > > > during hotplug detection or even inability of the DP driver to report
> > > > the display as connected.
> > > >
> > > > A proper fix will involve redesigning of the HPD handling in the MSM DP
> > > > driver. It is underway, but it will be intrusive and can not be thought
> > > > about as a simple fix for the issue. Thus, revert the offending commit.
> > >
> > > Yes, for fixing this on 6.9 I am fine with this.
> >
> > Since this is a regression in 6.8-rc1, I hope you meant to say 6.8 here?
> 
> In the worst case it will land to 6.8.x via the stable tree process.

This is a fix for a user-visible regression that was reported formally
two weeks ago and informally (e.g. to you) soon after rc1 came out, and
which now also has an identified cause and an analysis of the problem.
And we're at rc6 so there should be no reason to delay fixing this (e.g.
even if you want to run some more tests for a couple of days).

Johan


Re: [PATCH] Revert "drm/msm/dp: use drm_bridge_hpd_notify() to report HPD status changes"

2024-02-28 Thread Johan Hovold
On Tue, Feb 27, 2024 at 02:11:56PM -0800, Abhinav Kumar wrote:
> On 2/27/2024 2:08 PM, Dmitry Baryshkov wrote:
> > This reverts commit e467e0bde881 ("drm/msm/dp: use
> > drm_bridge_hpd_notify() to report HPD status changes").
> > 
> > The commit changed the way how the MSM DP driver communicates
> > HPD-related events to the userspace. The mentioned commit made some of
> > the HPD events being reported earlier. This way userspace starts poking
> > around. It interacts in a bad way with the dp_bridge_detect and the
> > driver's state machine, ending up either with the very long delays
> > during hotplug detection or even inability of the DP driver to report
> > the display as connected.
> > 
> > A proper fix will involve redesigning of the HPD handling in the MSM DP
> > driver. It is underway, but it will be intrusive and can not be thought
> > about as a simple fix for the issue. Thus, revert the offending commit.
> 
> Yes, for fixing this on 6.9 I am fine with this.

Since this is a regression in 6.8-rc1, I hope you meant to say 6.8 here?

> I hope there were not other changes which were built on top of this. So 
> it will be better if we retest internal HPD case as well with this.
> 
> We will do that in a day or two and give Tested-by.

Johan


Re: [PATCH] Revert "drm/msm/dp: use drm_bridge_hpd_notify() to report HPD status changes"

2024-02-27 Thread Johan Hovold
On Wed, Feb 28, 2024 at 12:08:08AM +0200, Dmitry Baryshkov wrote:
> This reverts commit e467e0bde881 ("drm/msm/dp: use
> drm_bridge_hpd_notify() to report HPD status changes").
> 
> The commit changed the way how the MSM DP driver communicates
> HPD-related events to the userspace. The mentioned commit made some of
> the HPD events being reported earlier. This way userspace starts poking
> around. It interacts in a bad way with the dp_bridge_detect and the
> driver's state machine, ending up either with the very long delays
> during hotplug detection or even inability of the DP driver to report
> the display as connected.
> 
> A proper fix will involve redesigning of the HPD handling in the MSM DP
> driver. It is underway, but it will be intrusive and can not be thought
> about as a simple fix for the issue. Thus, revert the offending commit.
> 
> Fixes: e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() to report HPD 
> status changes")
> Link: https://gitlab.freedesktop.org/drm/msm/-/issues/50
> Reported-by: Johan Hovold 
> Link: https://lore.kernel.org/r/zd3ypgmrprxv-...@hovoldconsulting.com/
> Signed-off-by: Dmitry Baryshkov 

Tested-by: Johan Hovold 


drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

2024-02-27 Thread Johan Hovold
Hi,

Since 6.8-rc1 I have seen (and received reports) of hard resets of the
Lenovo ThinkPad X13s after connecting and disconnecting an external
display.

I have triggered this on a simple disconnect while in a VT console, but
also when stopping Xorg after having repeatedly connected and
disconnected an external display and tried to enable it using xrandr.

In the former case, the last (custom debug) messages printed over an SSH
session were once:

[  948.416358] usb 5-1: USB disconnect, device number 3
[  948.443496] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug
[  948.443723] msm-dp-display ae98000.displayport-controller: 
dp_power_clk_enable - type = 1, enable = 0
[  948.443872] msm-dp-display ae98000.displayport-controller: 
dp_ctrl_phy_exit
[  948.445117] msm-dp-display ae98000.displayport-controller: 
dp_ctrl_phy_exit - done

and then the hypervisor resets the machine.

Hotplug in Xorg seems to work worse than it did with 6.7, which also had
some issues. Connecting a display once seems to work fine, but trying to
re-enable a reconnected display using xrandr sometimes does not work at
all, while with 6.7 it usually worked on the second xrandr execution.

xrandr reports the reconnected display as disconnected:

Screen 0: minimum 320 x 200, current 1920 x 1200, maximum 5120 x 4096
eDP-1 connected primary 1920x1200+0+0 (normal left inverted right x axis y 
axis) 286mm x 178mm
   1920x1200 60.03*+
   1600x1200 60.00  
DP-1 disconnected (normal left inverted right x axis y axis)
DP-2 disconnected 1920x1200+0+0 (normal left inverted right x axis y axis) 
0mm x 0mm
  1920x1200 (0x40c) 154.000MHz +HSync -VSync
h: width  1920 start 1968 end 2000 total 2080 skew0 clock  
74.04KHz
v: height 1200 start 1203 end 1209 total 1235   clock  
59.95Hz

Running 'xrandr --output DP-2 --auto' 2-3 times makes xrandr report the
display as connected, but the display is still blank (unlike with 6.7).

A few times after having exercised hotplug this way, the machine hard
resets when Xorg is later stopped. Once I saw the following log messages
on an SSH session but they may not have been printed directly before
the hard reset:

[  214.555781] [drm:dpu_encoder_phys_vid_wait_for_commit_done:492] [dpu 
error]vblank timeout
[  214.555843] [drm:dpu_kms_wait_for_commit_done:483] [dpu error]wait for 
commit done returned -110

Note that this appears to be unrelated to the recently fixed Qualcomm
power domain driver bug which can trigger similar resets when
initialising the display subsystem on boot. Specifically, I have
triggered the hotplug resets described above also with the fix applied.
[1]

Reverting commit e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify()
to report HPD status changes") which fixes the related VT console
regression does not seem to make any difference. [2]

Daniel Thompson reports that reverting the whole runtime PM series
appears to make the hard resets he has seen with DisplayPort hotplug go
away however:


https://lore.kernel.org/lkml/1701472789-25951-1-git-send-email-quic_khs...@quicinc.com/

So for now, let's assume that these regressions were also introduced (or
triggered) by commit 5814b8bf086a ("drm/msm/dp: incorporate pm_runtime
framework into DP driver").

Johan


[1] 
https://lore.kernel.org/lkml/20240226-rpmhpd-enable-corner-fix-v1-1-68c004cec...@quicinc.com/
[2] https://lore.kernel.org/lkml/zd3ypgmrprxv-...@hovoldconsulting.com/


#regzbot introduced: 5814b8bf086a


drm/msm: VT console DisplayPort regression in 6.8-rc1

2024-02-27 Thread Johan Hovold
Hi,

Since 6.8-rc1 the VT console is no longer mirrored on an external
display on coldplug or hotplug on the Lenovo ThinkPad X13s.

The hotplug notification appears to be generated immediately but it is
no longer forwarded or processed correctly:

[   22.578434] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug
[   22.953161] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug
[   23.095122] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug
[   24.185683] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug
[   24.186197] msm-dp-display ae98000.displayport-controller: 
dp_pm_runtime_resume
[   24.188098] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_init
[   24.191805] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug
[   24.275904] [drm-dp] dp_ctrl_on_link: dp_ctrl_on_link - drm_dev = 

[   24.276474] [drm-dp] dp_ctrl_enable_mainlink_clocks: 
dp_ctrl_enable_mainlink_clocks - drm_dev = 

as the external display remains off/blank.

I've verified that reverting commit e467e0bde881 ("drm/msm/dp: use
drm_bridge_hpd_notify() to report HPD status changes") fixes "both"
issues.

I've previously reported this here:

https://gitlab.freedesktop.org/drm/msm/-/issues/50

Note that a couple of times the display was eventually mirrored after a
very long timeout, but this doesn't seem to always be the case.

Johan


#regzbot introduced: e467e0bde881


Re: [PATCH 0/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free

2024-02-23 Thread Johan Hovold
On Fri, Feb 23, 2024 at 03:38:13PM +0100, Neil Armstrong wrote:
> On 23/02/2024 15:21, Johan Hovold wrote:

> > But it is *not* standalone as I tried to explain above.
> > 
> > So you have to drop it again as the later patches depend on it and
> > cannot be merged (through a different tree) without it.
> 
> drm-misc branches cannot be rebased, it must be reverted, but it can still be 
> applied
> on drm-misc-next and I'll send a revert patch for drm-misc-fixes if needed, 
> not a big deal.
> 
> > I thought you had all the acks you needed to take this through drm-misc,
> > but we can wait a bit more if necessary (and there's no rush to get the
> > first one in).
> 
> If you want it to be in v6.9, it's too late since the last drm-misc-next PR 
> has been sent
> yesterday 
> (https://cgit.freedesktop.org/drm/drm-misc/tag/?h=drm-misc-next-2024-02-22)
> 
> Please ping Thomas or Maxime, perhaps it's not too late since the 
> drm-misc-next tree
> really closes on sunday.

I don't want this in 6.9, this is needed for *6.8* as this fixes a DRM
regression in 6.8-rc1 that breaks the display on machines like the X13s.

If you guys can't sort this out in time, then perhaps Bjorn can take
this through the Qualcomm tree instead (with DRM acks).

But again, this is fixing a severe *regression* in 6.8-rc1. It can not
wait for 6.9.

Johan


Re: [PATCH 0/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free

2024-02-23 Thread Johan Hovold
On Fri, Feb 23, 2024 at 04:18:08PM +0200, Dmitry Baryshkov wrote:
> On Fri, 23 Feb 2024 at 15:52, Neil Armstrong  
> wrote:
> > On 23/02/2024 13:51, Johan Hovold wrote:
> > > On Fri, Feb 23, 2024 at 12:03:10PM +0100, Neil Armstrong wrote:
> > >> On 23/02/2024 12:02, Neil Armstrong wrote:

> > >>> Thanks, Applied to https://anongit.freedesktop.org/git/drm/drm-misc.git 
> > >>> (drm-misc-fixes)
> > >>>
> > >>> [1/6] drm/bridge: aux-hpd: fix OF node leaks
> > >>> 
> > >>> https://cgit.freedesktop.org/drm/drm-misc/commit/?id=9ee485bdda68d6d3f5728cbe3150eb9013d7d22b
> > >>> [2/6] drm/bridge: aux-hpd: separate allocation and registration
> > >>> (no commit info)
> > >>> [3/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
> > >>> (no commit info)
> > >>> [4/6] soc: qcom: pmic_glink: Fix boot when QRTR=m
> > >>> (no commit info)
> > >>> [5/6] phy: qcom-qmp-combo: fix drm bridge registration
> > >>> (no commit info)
> > >>> [6/6] phy: qcom-qmp-combo: fix type-c switch registration
> > >>> (no commit info)
> > >>>
> > >>
> > >> To clarify, I only applied patch 1 to drm-misc-fixes
> > >
> > > Ok, but can you please not do that? :)
> > >
> > > These patches should go in through the same tree to avoid conflicts.
> > >
> > > I discussed this with Bjorn and Dmitry the other day and the conclusion
> > > was that it was easiest to take all of these through DRM.
> >
> > I only applied patch 1, which is a standalone fix and goes into a separate 
> > tree,
> > for the next patches it would be indeed simpler for them to go via drm-misc 
> > when
> > they are properly acked.
> 
> I think PHY patches can go through a usual route (phy/next or
> phy/fixes).

They can, but I've explicitly asked Vinod to ack them so that they can
go in with the rest of the series through DRM.

They also fix a regression that came in through DRM in 6.8-rc1 (the
bridge rework which started registering child devices) so it makes sense
to also route the fix the same way. And to do it for this cycle.

> For patches 3 and 4 I'd need an ack from Bjorn to merge
> them through drm-misc-next or drm-misc-fixes.

You have Bjorn's ack. He's reviewed all the patches for this purpose and
we discussed this in person two days ago.

And, again, this has to go in for *this* cycle. You broke the display on
the X13s and other machines so this cannot wait for 6.9.

Johan


Re: [PATCH 0/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free

2024-02-23 Thread Johan Hovold
On Fri, Feb 23, 2024 at 02:52:28PM +0100, Neil Armstrong wrote:
> On 23/02/2024 13:51, Johan Hovold wrote:
> > On Fri, Feb 23, 2024 at 12:03:10PM +0100, Neil Armstrong wrote:
> >> On 23/02/2024 12:02, Neil Armstrong wrote:

> >>> Thanks, Applied to https://anongit.freedesktop.org/git/drm/drm-misc.git 
> >>> (drm-misc-fixes)
> >>>
> >>> [1/6] drm/bridge: aux-hpd: fix OF node leaks
> >>> 
> >>> https://cgit.freedesktop.org/drm/drm-misc/commit/?id=9ee485bdda68d6d3f5728cbe3150eb9013d7d22b
> >>> [2/6] drm/bridge: aux-hpd: separate allocation and registration
> >>> (no commit info)
> >>> [3/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
> >>> (no commit info)
> >>> [4/6] soc: qcom: pmic_glink: Fix boot when QRTR=m
> >>> (no commit info)
> >>> [5/6] phy: qcom-qmp-combo: fix drm bridge registration
> >>> (no commit info)
> >>> [6/6] phy: qcom-qmp-combo: fix type-c switch registration
> >>> (no commit info)
> >>>
> >>
> >> To clarify, I only applied patch 1 to drm-misc-fixes
> > 
> > Ok, but can you please not do that? :)
> > 
> > These patches should go in through the same tree to avoid conflicts.
> > 
> > I discussed this with Bjorn and Dmitry the other day and the conclusion
> > was that it was easiest to take all of these through DRM.
> 
> I only applied patch 1, which is a standalone fix and goes into a separate 
> tree,
> for the next patches it would be indeed simpler for them to go via drm-misc 
> when
> they are properly acked.

But it is *not* standalone as I tried to explain above.

So you have to drop it again as the later patches depend on it and
cannot be merged (through a different tree) without it.

I thought you had all the acks you needed to take this through drm-misc,
but we can wait a bit more if necessary (and there's no rush to get the
first one in).

Johan


Re: [PATCH 0/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free

2024-02-23 Thread Johan Hovold
On Fri, Feb 23, 2024 at 12:03:10PM +0100, Neil Armstrong wrote:
> On 23/02/2024 12:02, Neil Armstrong wrote:
> > Hi,
> > 
> > On Sat, 17 Feb 2024 16:02:22 +0100, Johan Hovold wrote:
> >> Starting with 6.8-rc1 the internal display sometimes fails to come up on
> >> machines like the Lenovo ThinkPad X13s and the logs indicate that this
> >> is due to a regression in the DRM subsystem [1].
> >>
> >> This series fixes a race in the pmic_glink_altmode driver which was
> >> exposed / triggered by the transparent DRM bridges rework that went into
> >> 6.8-rc1 and that manifested itself as a bridge failing to attach and
> >> sometimes triggering a NULL-pointer dereference.
> >>
> >> [...]
> > 
> > Thanks, Applied to https://anongit.freedesktop.org/git/drm/drm-misc.git 
> > (drm-misc-fixes)
> > 
> > [1/6] drm/bridge: aux-hpd: fix OF node leaks
> >
> > https://cgit.freedesktop.org/drm/drm-misc/commit/?id=9ee485bdda68d6d3f5728cbe3150eb9013d7d22b
> > [2/6] drm/bridge: aux-hpd: separate allocation and registration
> >(no commit info)
> > [3/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
> >(no commit info)
> > [4/6] soc: qcom: pmic_glink: Fix boot when QRTR=m
> >(no commit info)
> > [5/6] phy: qcom-qmp-combo: fix drm bridge registration
> >(no commit info)
> > [6/6] phy: qcom-qmp-combo: fix type-c switch registration
> >(no commit info)
> > 
> 
> To clarify, I only applied patch 1 to drm-misc-fixes

Ok, but can you please not do that? :)

These patches should go in through the same tree to avoid conflicts.

I discussed this with Bjorn and Dmitry the other day and the conclusion
was that it was easiest to take all of these through DRM.

With Vinod acking the PHY patches, I believe you have what you need to
merge the whole series now?

Johan


Re: [PATCH 2/6] drm/bridge: aux-hpd: separate allocation and registration

2024-02-23 Thread Johan Hovold
On Thu, Feb 22, 2024 at 10:57:07PM +0200, Dmitry Baryshkov wrote:
> On Sat, 17 Feb 2024 at 17:03, Johan Hovold  wrote:
> >
> > Combining allocation and registration is an anti-pattern that should be
> > avoided. Add two new functions for allocating and registering an dp-hpd
> > bridge with a proper 'devm' prefix so that it is clear that these are
> > device managed interfaces.
> >
> > devm_drm_dp_hpd_bridge_alloc()
> > devm_drm_dp_hpd_bridge_add()
> >
> > The new interface will be used to fix a use-after-free bug in the
> > Qualcomm PMIC GLINK driver and may prevent similar issues from being
> > introduced elsewhere.
> >
> > The existing drm_dp_hpd_bridge_register() is reimplemented using the
> > above and left in place for now.
> >
> > Signed-off-by: Johan Hovold 
> 
> Reviewed-by: Dmitry Baryshkov 

Thanks for reviewing.

> Minor nit below.

> > diff --git a/include/drm/bridge/aux-bridge.h 
> > b/include/drm/bridge/aux-bridge.h
> > index c4c423e97f06..4453906105ca 100644
> > --- a/include/drm/bridge/aux-bridge.h
> > +++ b/include/drm/bridge/aux-bridge.h
> > @@ -9,6 +9,8 @@
> >
> >  #include 
> >
> > +struct auxiliary_device;
> > +
> >  #if IS_ENABLED(CONFIG_DRM_AUX_BRIDGE)
> >  int drm_aux_bridge_register(struct device *parent);
> >  #else
> > @@ -19,10 +21,23 @@ static inline int drm_aux_bridge_register(struct device 
> > *parent)
> >  #endif
> >
> >  #if IS_ENABLED(CONFIG_DRM_AUX_HPD_BRIDGE)
> > +struct auxiliary_device *devm_drm_dp_hpd_bridge_alloc(struct device 
> > *parent, struct device_node *np);
> > +int devm_drm_dp_hpd_bridge_add(struct device *dev, struct auxiliary_device 
> > *adev);
> 
> I had a pretty close idea during prototyping, but I ended up doing it
> as a single function for the following reasons:
> 
> First, this exports the implementation detail that internally the code
> uses an aux device.

That's not an issue. The opposite, with interfaces trying to do too much
and hide details from the developers so that they can no longer reason
about what is going on, is a real problem though.

> Also, by exporting the aux device the code becomes less type-safe. By
> mistake one can call devm_drm_dp_hpd_bridge_add() on any aux device,
> which is not necessarily the HPD bridge.

No. First, that is currently not even an issue either as the
registration interface is safe to use with any aux device.

Second, if you cared about about type-safety you wouldn't have used a
struct device pointer for drm_aux_hpd_bridge_notify() which you back
cast to an aux device.

> I'd prefer to see an opaque device-specific structure instead. WDYT?

That should have been there from the start. But I'm not interested in
cleaning up this mess beyond what is minimally required to fix the
regressions it caused.

This can be reworked for 6.9 or later.

> >  struct device *drm_dp_hpd_bridge_register(struct device *parent,
> >   struct device_node *np);
> >  void drm_aux_hpd_bridge_notify(struct device *dev, enum 
> > drm_connector_status status);

Johan


Re: [PATCH 2/6] drm/bridge: aux-hpd: separate allocation and registration

2024-02-22 Thread Johan Hovold
On Wed, Feb 21, 2024 at 08:06:41PM -0600, Bjorn Andersson wrote:
> On Sat, Feb 17, 2024 at 04:02:24PM +0100, Johan Hovold wrote:
> > diff --git a/drivers/gpu/drm/bridge/aux-hpd-bridge.c 
> > b/drivers/gpu/drm/bridge/aux-hpd-bridge.c
> [..]
> > +/**
> > + * devm_drm_dp_hpd_bridge_add - register a HDP DisplayPort bridge
> 
> kernel-doc wants () after function names.

I don't think that's required for the symbol name here even if some
subsystems (drivers) use it.

> > + * @dev: struct device to tie registration lifetime to
> > + * @adev: bridge auxiliary device to be registered
> > + *
> > + * Returns: zero on success or a negative errno
> 
> and "Return:" without the 's'.

This was a mistake however. Perhaps whoever applies this can drop it, or
I can send a v2.

Side note: Looks like there are more instances with an 's' than without
under driver/gpu/drm...

> This could however be done in a separate patch, as the file is already
> wrong in this regard.
> 
> Reviewed-by: Bjorn Andersson 

Thanks for reviewing.

Johan


Re: drm/msm: Second DisplayPort regression in 6.8-rc1

2024-02-21 Thread Johan Hovold
On Tue, Feb 20, 2024 at 01:19:54PM -0800, Abhinav Kumar wrote:
> On 2/19/2024 2:41 AM, Johan Hovold wrote:

> > It seems my initial suspicion that at least some of these regressions
> > were related to the runtime PM work was correct. The hard resets happens
> > when the DP controller is runtime suspended after being probed:

> > [   17.074925] bus: 'platform': __driver_probe_device: matched device 
> > aea.displayport-controller with driver msm-dp-display
> > [Starting Network Time Synchronization...
> > [   17.112000] msm-dp-display aea.displayport-controller: 
> > dp_display_probe - populate aux bus
> > [   17.125208] msm-dp-display aea.displayport-controller: 
> > dp_pm_runtime_resume
> >   Starting Record System Boot/Shutdown in UTMP...
> >   Starting Virtual Console Setup...
> > [  OK  ] Finished Load/Save Screen Backlight Brightness of 
> > backlight:backlight.
> > [   17.197909] msm-dp-display aea.displayport-controller: 
> > dp_pm_runtime_suspend
> > [   17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - 
> > Optional Info
> > Log Type: B - Since Boot(Power On Reset),  D - Delta,  S - Statistic
> > S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
> > S - IMAGE_VARIANT_STRING=SocMakenaWP
> > S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92
> > 
> >< machine is reset by hypervisor >
> > 
> > Presumably the reset happens when controller is being shut down while
> > still being used by the EFI framebuffer.
> 
> I am not sure if we can conclude like that. Even if we shut off the 
> controller when the framebuffer was still being fetched that should only 
> cause a blank screen and not a reset because we really don't trigger a 
> new register write / read while its fetching so as such there is no new 
> hardware access.

It specifically looks like the reset happens when shutting down the PHY,
that is, the call to dp_display_host_phy_exit(dp) in
dp_pm_runtime_suspend() never returns.

That seems like more than a coincidence to me.
 
> One thing I must accept is that there are two differences between 
> sc8280xp where we are hitting these resets and sc7180/sc7280 chromebooks 
> where we tested it more thoroughly without any such issues:
> 
> 1) with the chromebooks we have depthcharge and not the QC UEFI.
> 
> If we are suspecting a hand-off issue here, will it help if we try to 
> disable the display in EFI by using "fastboot oem select-display-panel 
> none" (assuming this is a fastboot enabled device) and see if you still 
> hit the reset issue?

No, we don't have fastboot.

But as I mentioned I still do see resets when I instrument the code to
not shut down the display, which could indicate more than one issue
here.

> 2) chromebooks used "internal_hpd" whereas the pmic_glink method used in 
> the sc8280xp.
> 
> I am still checking if there are any code paths in the eDP/DP driver 
> left exposed due to this difference with pm_runtime which can cause 
> this. I am wondering if some sort of drm tracing will help to narrow 
> down the reset point.
> 
> > In the cases where the machines survives boot, the controller is never
> > suspended.
> > 
> > When investigating this I've also seen intermittent:
> > 
> > [drm:dp_display_probe [msm]] *ERROR* device tree parsing failed
> 
> So this error I think is because in dp_parser_parse() ---> 
> dp_parser_ctrl_res(), we also have a devm_phy_get().
> 
> This can return -EDEFER if the phy driver has not yet probed.
> 
> I checked the other things inside dp_parser_parse(), others calls seem 
> to be purely DT parsing except this one. I think to avoid the confusion, 
> we should move devm_phy_get() outside of DT parsing into a separate call 
> or atleast add an error log inside devm_phy_get() failure below to 
> indicate that it deferred
> 
>  io->phy = devm_phy_get(>dev, "dp");
>  if (IS_ERR(io->phy))
>  return PTR_ERR(io->phy);
> 
> If my hypothesis is correct on this, then this error log (even though 
> misleading) should be harmless for this issue because if we hit 
> DRM_ERROR("device tree parsing failed\n"); we will skip the 
> devm_pm_runtime_enable().

Yeah, this seems to be the case as boot appears to recover from this, so
this may indeed be a probe deferral.

Probe deferrals should not be logged as errors however, so the fix is
not to add another error message but rather to suppress the current one
(e.g. using dev_err_probe()).

> > Has anyone given some thought to how the framebuffer handover is
> > supposed to work? It seems we're currently just relying on luck with
> > timing.

Any comments to this? It seems we should not be shutting down (runtime
suspend) the display during boot as can currently happen.

Johan


Re: [PATCH 3/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free

2024-02-20 Thread Johan Hovold
On Tue, Feb 20, 2024 at 11:55:57AM +0100, Markus Elfring wrote:
> …
> > Specifically, the dp-hpd bridge is currently registered before all
> > resources have been acquired which means that it can also be
> > deregistered on probe deferrals.
> >
> > In the meantime there is a race window where the new aux bridge driver
> > (or PHY driver previously) may have looked up the dp-hpd bridge and
> > stored a (non-reference-counted) pointer to the bridge which is about to
> > be deallocated.
> …
> > +++ b/drivers/soc/qcom/pmic_glink_altmode.c
> …
> > @@ -454,7 +454,7 @@ static int pmic_glink_altmode_probe(struct 
> > auxiliary_device *adev,
> > alt_port->index = port;
> > INIT_WORK(_port->work, pmic_glink_altmode_worker);
> >
> > -   alt_port->bridge = drm_dp_hpd_bridge_register(dev, 
> > to_of_node(fwnode));
> > +   alt_port->bridge = devm_drm_dp_hpd_bridge_alloc(dev, 
> > to_of_node(fwnode));
> > if (IS_ERR(alt_port->bridge)) {
> > fwnode_handle_put(fwnode);
> > return PTR_ERR(alt_port->bridge);
> …
> 
> The function call “fwnode_handle_put(fwnode)” is used in multiple if branches.
> https://elixir.bootlin.com/linux/v6.8-rc5/source/drivers/soc/qcom/pmic_glink_altmode.c#L435
> 
> I suggest to add a jump target so that a bit of exception handling
> can be better reused at the end of this function implementation.

Markus, as people have told you repeatedly, just stop with these
comments. You're not helping, in fact, you are actively harmful to the
kernel community as you are wasting people's time.

Johan


Re: [PATCH] drm/msm: Wire up tlb ops

2024-02-20 Thread Johan Hovold
On Thu, Feb 15, 2024 at 07:28:53AM -0800, Rob Clark wrote:
> On Wed, Feb 14, 2024 at 11:34 PM Johan Hovold  wrote:
> >
> > On Tue, Feb 13, 2024 at 09:23:40AM -0800, Rob Clark wrote:
> > > From: Rob Clark 
> > >
> > > The brute force iommu_flush_iotlb_all() was good enough for unmap, but
> > > in some cases a map operation could require removing a table pte entry
> > > to replace with a block entry.  This also requires tlb invalidation.
> > > Missing this was resulting an obscure iova fault on what should be a
> > > valid buffer address.
> > >
> > > Thanks to Robin Murphy for helping me understand the cause of the fault.
> > >
> > > Cc: Robin Murphy 
> > > Fixes: b145c6e65eb0 ("drm/msm: Add support to create a local pagetable")
> >
> > Sounds like you're missing a
> >
> > Cc: sta...@vger.kernel.org
> >
> > here? Or is there some reason not to backport this fix (to 5.9 and later
> > kernels)?
> 
> No reason, I just expected the Fixes tag was sufficient

No, you should still mark patches intended for stable with an explicit
CC-stable tag (as per the documentation).

The fact that Sasha and his AI tries to catch fixes which the author and
maintainer failed to tag as a fallback should not be relied upon. It
also makes it harder for the stable team and others to determine what
the intention with a fix was.

Johan


Re: [PATCH 1/6] drm/bridge: aux-hpd: fix OF node leaks

2024-02-19 Thread Johan Hovold
On Mon, Feb 19, 2024 at 06:48:30PM +0100, Markus Elfring wrote:
> > The two device node references taken during allocation need to be
> > dropped when the auxiliary device is freed.
> …
> > +++ b/drivers/gpu/drm/bridge/aux-hpd-bridge.c
> …
> > @@ -74,6 +75,8 @@ struct device *drm_dp_hpd_bridge_register(struct device 
> > *parent,
> >
> > ret = auxiliary_device_init(adev);
> > if (ret) {
> > +   of_node_put(adev->dev.platform_data);
> > +   of_node_put(adev->dev.of_node);
> > ida_free(_aux_hpd_bridge_ida, adev->id);
> > kfree(adev);
> > return ERR_PTR(ret);
> 
> The last two statements are also used in a previous if branch.
> https://elixir.bootlin.com/linux/v6.8-rc5/source/drivers/gpu/drm/bridge/aux-hpd-bridge.c#L63
> 
> How do you think about to avoid such a bit of duplicate source code
> by adding a label here?

No, the current code is fine and what you are suggesting is in any case
unrelated to this fix.

If this function ever grows a third error path like that, I too would
consider it however.

Johan


Re: drm/msm: Second DisplayPort regression in 6.8-rc1

2024-02-19 Thread Johan Hovold
On Mon, Feb 19, 2024 at 11:41:41AM +0100, Johan Hovold wrote:

> It seems my initial suspicion that at least some of these regressions
> were related to the runtime PM work was correct. The hard resets happens
> when the DP controller is runtime suspended after being probed:
 
> [   17.074925] bus: 'platform': __driver_probe_device: matched device 
> aea.displayport-controller with driver msm-dp-display
> [   17.112000] msm-dp-display aea.displayport-controller: 
> dp_display_probe - populate aux bus
> [   17.125208] msm-dp-display aea.displayport-controller: 
> dp_pm_runtime_resume
> [   17.197909] msm-dp-display aea.displayport-controller: 
> dp_pm_runtime_suspend
> [   17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - 
> Optional Info
> Log Type: B - Since Boot(Power On Reset),  D - Delta,  S - Statistic
> S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
> S - IMAGE_VARIANT_STRING=SocMakenaWP
> S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92
> 
>   < machine is reset by hypervisor >
> 
> Presumably the reset happens when controller is being shut down while
> still being used by the EFI framebuffer.
> 
> In the cases where the machines survives boot, the controller is never
> suspended.
> 
> When investigating this I've also seen intermittent:
> 
>   [drm:dp_display_probe [msm]] *ERROR* device tree parsing failed

Note that there are further indications there may be more than one bug
here too.

I definitely see hard resets when dp_pm_runtime_suspend() is shutting
down the eDP PHY, but there are occasional resets also if I instrument
DP controller probe() to resume and then prevent the controller from
suspending until after a timeout (e.g. to be used as a temporary
workaround):

[   15.676495] bus: 'platform': __driver_probe_device: matched device 
aea.displayport-controller with driver msm-dp-display
[   15.769392] msm-dp-display aea.displayport-controller: dp_display_probe 
- populate aux bus
[   15.778808] msm-dp-display aea.displayport-controller: dp_display_probe 
- scheduling handover
[   15.789931] probe of aea.displayport-controller returned 0 after 91121 
usecs
[   15.790460] bus: 'dp-aux': __driver_probe_device: matched device 
aux-aea.displayport-controller with driver panel-simple-dp-aux
Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset),  D - Delta,  S - Statistic
S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1

I'll wait for the maintainers and authors of this code to comment, but
it seems the runtime PM work is broken in multiple ways.

Johan


drm/msm: Second DisplayPort regression in 6.8-rc1

2024-02-19 Thread Johan Hovold
On Sat, Feb 17, 2024 at 04:14:58PM +0100, Johan Hovold wrote:
> On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote:
> > On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:

> Since Dmitry had trouble reproducing this issue I took a closer look at
> the DRM aux bridge series that Abhinav pointed and was able to track
> down the bridge regressions and come up with a reproducer. I just posted
> a series fixing this here:
> 
>   
> https://lore.kernel.org/lkml/20240217150228.5788-1-johan+lin...@kernel.org/
> 
> As I mentioned in the cover letter, I am still seeing intermittent hard
> resets around the time that the DRM subsystem is initialising, which
> suggests that we may be dealing with two separate DRM regressions here
> however.
> 
> If the hard resets are triggered by something like unclocked hardware,
> perhaps that bit could this be related to the runtime PM rework?

It seems my initial suspicion that at least some of these regressions
were related to the runtime PM work was correct. The hard resets happens
when the DP controller is runtime suspended after being probed:

[   16.748475] bus: 'platform': __driver_probe_device: matched device 
ae0.display-subsystem with driver msm-mdss
[   16.759444] msm-mdss ae0.display-subsystem: Adding to iommu group 21
[   16.795226] bus: 'platform': __driver_probe_device: matched device 
ae01000.display-controller with driver msm_dpu
[   16.807542] probe of ae01000.display-controller returned -517 after 3 usecs
[   16.821552] bus: 'platform': __driver_probe_device: matched device 
ae9.displayport-controller with driver msm-dp-display
[   16.837749] probe of ae9.displayport-controller returned -517 after 1 
usecs
[  OK  ] Listening on Load/Save RF Kill Swit[   16.854659] bus: 'platform': 
__dch Status /dev/rfkill Watch.
[   16.868458] probe of ae98000.displayport-controller returned -517 after 2 
usecs
[   16.880012] bus: 'platform': __driver_probe_device: matched device 
aea.displayport-controller with driver msm-dp-display
[   16.891856] probe of aea.displayport-controller returned -517 after 2 
usecs
[   16.903825] probe of ae0.display-subsystem returned 0 after 144497 usecs
[   16.911636] bus: 'platform': __driver_probe_device: matched device 
ae01000.display-controller with driver msm_dpu
[   16.942092] probe of ae01000.display-controller returned 0 after 19593 usecs
 Starting Load/Save Screen Backligh…rightness[   16.959146] bus: 
'platform': _ of backlight:backlight...
[   16.995355] msm-dp-display ae9.displayport-controller: dp_display_probe 
- probe tail
[   17.004032] probe of ae9.displayport-controller returned 0 after 30225 
usecs
[   17.012308] bus: 'platform': __driver_probe_device: matched device 
ae98000.displayport-controller with driver msm-dp-display
[   17.050193] msm-dp-display ae98000.displayport-controller: dp_display_probe 
- probe tail
 Starting Network Name Resolution...
[   17.058925] probe of ae98000.displayport-controller returned 0 after 34774 
usecs
[   17.074925] bus: 'platform': __driver_probe_device: matched device 
aea.displayport-controller with driver msm-dp-display
[Starting Network Time Synchronization...
[   17.112000] msm-dp-display aea.displayport-controller: dp_display_probe 
- populate aux bus
[   17.125208] msm-dp-display aea.displayport-controller: 
dp_pm_runtime_resume
 Starting Record System Boot/Shutdown in UTMP...
 Starting Virtual Console Setup...
[  OK  ] Finished Load/Save Screen Backlight Brightness of backlight:backlight.
[   17.197909] msm-dp-display aea.displayport-controller: 
dp_pm_runtime_suspend
[   17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - 
Optional Info
Log Type: B - Since Boot(Power On Reset),  D - Delta,  S - Statistic
S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
S - IMAGE_VARIANT_STRING=SocMakenaWP
S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92

  < machine is reset by hypervisor >

Presumably the reset happens when controller is being shut down while
still being used by the EFI framebuffer.

In the cases where the machines survives boot, the controller is never
suspended.

When investigating this I've also seen intermittent:

[drm:dp_display_probe [msm]] *ERROR* device tree parsing failed

which also appears to be related to the runtime PM rework:


https://lore.kernel.org/lkml/1701472789-25951-1-git-send-email-quic_khs...@quicinc.com/

I believe this is enough evidence to conclude that this second
regression is introduced by commit 5814b8bf086a ("drm/msm/dp:
incorporate pm_runtime framework into DP driver"):

#regzbot introduced: 5814b8bf086a

Has anyone given some thought to how the framebuffer handover is
supposed to work? It seems we're currently just relying on luck with
timing.

Johan


Re: drm/msm: DisplayPort regressions in 6.8-rc1

2024-02-17 Thread Johan Hovold
On Tue, Feb 13, 2024 at 12:42:17PM +0100, Johan Hovold wrote:

> Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
> not always show up on boot.
> 
> The logs indicate problems with the runtime PM and eDP rework that went
> into 6.8-rc1:
> 
>   [6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach 
> bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16

> and this can also manifest itself as a NULL-pointer dereference:
> 
>   [7.339447] Unable to handle kernel NULL pointer dereference at 
> virtual address 
>   
>   [7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]

#regzbot ^introduced: 2bcca96abfbf

It looks like it may have been possible to hit this also before commit
2bcca96abfbf ("soc: qcom: pmic-glink: switch to DRM_AUX_HPD_BRIDGE") and
the transparent bridge rework in 6.8-rc1 even if that has not yet been
confirmed.

The above is what made this trigger since 6.8-rc1 however.

Johan


Re: drm/msm: DisplayPort regressions in 6.8-rc1

2024-02-17 Thread Johan Hovold
On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote:
> On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:
> 
> > I do agree that pm runtime eDP driver got merged that time but I think 
> > the issue is either a combination of that along with DRM aux bridge 
> > https://patchwork.freedesktop.org/series/122584/ OR just the latter as 
> > even that went in around the same time.
> 
> Yes, indeed there was a lot of changes that went into the MSM drm driver
> in 6.8-rc1 and since I have not tried to debug this myself I can't say
> for sure which change or changes that triggered this regression (or
> possibly regressions).
> 
> The fact that the USB-C/DP PHY appears to be involved
> (/soc@0/phy@88eb000) could indeed point to the series you mentioned.
> 
> > Thats why perhaps this issue was not seen with the chromebooks we tested 
> > on as they do not use pmic_glink (aux bridge).
> > 
> > So we will need to debug this on sc8280xp specifically or an equivalent 
> > device which uses aux bridge.
> 
> I've hit the NULL-pointer deference three times now in the last few days
> on the sc8280xp CRD. But since it doesn't trigger on every boot it seems
> you need to go back to the series that could potentially have caused
> this regression and review them again. There's clearly something quite
> broken here.

Since Dmitry had trouble reproducing this issue I took a closer look at
the DRM aux bridge series that Abhinav pointed and was able to track
down the bridge regressions and come up with a reproducer. I just posted
a series fixing this here:


https://lore.kernel.org/lkml/20240217150228.5788-1-johan+lin...@kernel.org/

As I mentioned in the cover letter, I am still seeing intermittent hard
resets around the time that the DRM subsystem is initialising, which
suggests that we may be dealing with two separate DRM regressions here
however.

If the hard resets are triggered by something like unclocked hardware,
perhaps that bit could this be related to the runtime PM rework?

Johan


[PATCH 2/6] drm/bridge: aux-hpd: separate allocation and registration

2024-02-17 Thread Johan Hovold
Combining allocation and registration is an anti-pattern that should be
avoided. Add two new functions for allocating and registering an dp-hpd
bridge with a proper 'devm' prefix so that it is clear that these are
device managed interfaces.

devm_drm_dp_hpd_bridge_alloc()
devm_drm_dp_hpd_bridge_add()

The new interface will be used to fix a use-after-free bug in the
Qualcomm PMIC GLINK driver and may prevent similar issues from being
introduced elsewhere.

The existing drm_dp_hpd_bridge_register() is reimplemented using the
above and left in place for now.

Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/bridge/aux-hpd-bridge.c | 67 +++--
 include/drm/bridge/aux-bridge.h | 15 ++
 2 files changed, 67 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/bridge/aux-hpd-bridge.c 
b/drivers/gpu/drm/bridge/aux-hpd-bridge.c
index 9e71daf95bde..6886db2d9e00 100644
--- a/drivers/gpu/drm/bridge/aux-hpd-bridge.c
+++ b/drivers/gpu/drm/bridge/aux-hpd-bridge.c
@@ -30,16 +30,13 @@ static void drm_aux_hpd_bridge_release(struct device *dev)
kfree(adev);
 }
 
-static void drm_aux_hpd_bridge_unregister_adev(void *_adev)
+static void drm_aux_hpd_bridge_free_adev(void *_adev)
 {
-   struct auxiliary_device *adev = _adev;
-
-   auxiliary_device_delete(adev);
-   auxiliary_device_uninit(adev);
+   auxiliary_device_uninit(_adev);
 }
 
 /**
- * drm_dp_hpd_bridge_register - Create a simple HPD DisplayPort bridge
+ * devm_drm_dp_hpd_bridge_alloc - allocate a HPD DisplayPort bridge
  * @parent: device instance providing this bridge
  * @np: device node pointer corresponding to this bridge instance
  *
@@ -47,11 +44,9 @@ static void drm_aux_hpd_bridge_unregister_adev(void *_adev)
  * DRM_MODE_CONNECTOR_DisplayPort, which terminates the bridge chain and is
  * able to send the HPD events.
  *
- * Return: device instance that will handle created bridge or an error code
- * encoded into the pointer.
+ * Return: bridge auxiliary device pointer or an error pointer
  */
-struct device *drm_dp_hpd_bridge_register(struct device *parent,
- struct device_node *np)
+struct auxiliary_device *devm_drm_dp_hpd_bridge_alloc(struct device *parent, 
struct device_node *np)
 {
struct auxiliary_device *adev;
int ret;
@@ -82,13 +77,55 @@ struct device *drm_dp_hpd_bridge_register(struct device 
*parent,
return ERR_PTR(ret);
}
 
-   ret = auxiliary_device_add(adev);
-   if (ret) {
-   auxiliary_device_uninit(adev);
+   ret = devm_add_action_or_reset(parent, drm_aux_hpd_bridge_free_adev, 
adev);
+   if (ret)
return ERR_PTR(ret);
-   }
 
-   ret = devm_add_action_or_reset(parent, 
drm_aux_hpd_bridge_unregister_adev, adev);
+   return adev;
+}
+EXPORT_SYMBOL_GPL(devm_drm_dp_hpd_bridge_alloc);
+
+static void drm_aux_hpd_bridge_del_adev(void *_adev)
+{
+   auxiliary_device_delete(_adev);
+}
+
+/**
+ * devm_drm_dp_hpd_bridge_add - register a HDP DisplayPort bridge
+ * @dev: struct device to tie registration lifetime to
+ * @adev: bridge auxiliary device to be registered
+ *
+ * Returns: zero on success or a negative errno
+ */
+int devm_drm_dp_hpd_bridge_add(struct device *dev, struct auxiliary_device 
*adev)
+{
+   int ret;
+
+   ret = auxiliary_device_add(adev);
+   if (ret)
+   return ret;
+
+   return devm_add_action_or_reset(dev, drm_aux_hpd_bridge_del_adev, adev);
+}
+EXPORT_SYMBOL_GPL(devm_drm_dp_hpd_bridge_add);
+
+/**
+ * drm_dp_hpd_bridge_register - allocate and register a HDP DisplayPort bridge
+ * @parent: device instance providing this bridge
+ * @np: device node pointer corresponding to this bridge instance
+ *
+ * Return: device instance that will handle created bridge or an error pointer
+ */
+struct device *drm_dp_hpd_bridge_register(struct device *parent, struct 
device_node *np)
+{
+   struct auxiliary_device *adev;
+   int ret;
+
+   adev = devm_drm_dp_hpd_bridge_alloc(parent, np);
+   if (IS_ERR(adev))
+   return ERR_CAST(adev);
+
+   ret = devm_drm_dp_hpd_bridge_add(parent, adev);
if (ret)
return ERR_PTR(ret);
 
diff --git a/include/drm/bridge/aux-bridge.h b/include/drm/bridge/aux-bridge.h
index c4c423e97f06..4453906105ca 100644
--- a/include/drm/bridge/aux-bridge.h
+++ b/include/drm/bridge/aux-bridge.h
@@ -9,6 +9,8 @@
 
 #include 
 
+struct auxiliary_device;
+
 #if IS_ENABLED(CONFIG_DRM_AUX_BRIDGE)
 int drm_aux_bridge_register(struct device *parent);
 #else
@@ -19,10 +21,23 @@ static inline int drm_aux_bridge_register(struct device 
*parent)
 #endif
 
 #if IS_ENABLED(CONFIG_DRM_AUX_HPD_BRIDGE)
+struct auxiliary_device *devm_drm_dp_hpd_bridge_alloc(struct device *parent, 
struct device_node *np);
+int devm_drm_dp_hpd_bridge_add(struct device *dev, struct auxiliary_device 
*adev);
 struct device *drm_dp_hpd_bridge_register(struct device

[PATCH 0/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free

2024-02-17 Thread Johan Hovold
Starting with 6.8-rc1 the internal display sometimes fails to come up on
machines like the Lenovo ThinkPad X13s and the logs indicate that this
is due to a regression in the DRM subsystem [1].

This series fixes a race in the pmic_glink_altmode driver which was
exposed / triggered by the transparent DRM bridges rework that went into
6.8-rc1 and that manifested itself as a bridge failing to attach and
sometimes triggering a NULL-pointer dereference.

The intermittent hard resets that have also been reported since 6.8-rc1
unfortunately still remains and suggests that we are dealing with two
separate regressions. There is some indication that also the hard resets
(e.g. due to register accesses to unclocked hardware) are also due to
changes in the DRM subsystem as it happens around the time that the eDP
panel and display controller would be initialised during boot (the
runtime PM rework?). This remains to be verified, however.

Included is also a fix for a related OF node reference leak in the
aux-hpd driver found through inspection when reworking the driver.

The use-after-free bug is triggered by a probe deferral and highlighted
some further bugs in the involved drivers, which were registering child
devices before deferring probe. This behaviour is not correct and can
both trigger probe deferral loops and potentially also further issues
with the DRM bridge implementation.

This series can either go through the Qualcomm SoC tree (pmic_glink) or
the DRM tree. The PHY patches do not depend on the rest of the series
and could possibly be merged separately through the PHY tree.

Whichever gets this to mainline the fastest.

Johan


[1] https://lore.kernel.org/lkml/zctvmlk4ztwcp...@hovoldconsulting.com/


Johan Hovold (5):
  drm/bridge: aux-hpd: fix OF node leaks
  drm/bridge: aux-hpd: separate allocation and registration
  soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
  phy: qcom-qmp-combo: fix drm bridge registration
  phy: qcom-qmp-combo: fix type-c switch registration

Rob Clark (1):
  soc: qcom: pmic_glink: Fix boot when QRTR=m

 drivers/gpu/drm/bridge/aux-hpd-bridge.c   | 70 ++-
 drivers/phy/qualcomm/phy-qcom-qmp-combo.c | 16 +++---
 drivers/soc/qcom/pmic_glink.c | 21 +++
 drivers/soc/qcom/pmic_glink_altmode.c | 16 +-
 include/drm/bridge/aux-bridge.h   | 15 +
 5 files changed, 102 insertions(+), 36 deletions(-)

-- 
2.43.0



[PATCH 5/6] phy: qcom-qmp-combo: fix drm bridge registration

2024-02-17 Thread Johan Hovold
Due to a long-standing issue in driver core, drivers may not probe defer
after having registered child devices to avoid triggering a probe
deferral loop (see fbc35b45f9f6 ("Add documentation on meaning of
-EPROBE_DEFER")).

This could potentially also trigger a bug in the DRM bridge
implementation which does not expect bridges to go away even if device
links may avoid triggering this (when enabled).

Move registration of the DRM aux bridge to after looking up clocks and
other resources.

Note that PHY creation can in theory also trigger a probe deferral when
a 'phy' supply is used. This does not seem to affect the QMP PHY driver
but the PHY subsystem should be reworked to address this (i.e. by
separating initialisation and registration of the PHY).

Fixes: 35921910bbd0 ("phy: qcom: qmp-combo: switch to DRM_AUX_BRIDGE")
Fixes: 1904c3f578dc ("phy: qcom-qmp-combo: Introduce drm_bridge")
Cc: sta...@vger.kernel.org  # 6.5
Cc: Bjorn Andersson 
Cc: Dmitry Baryshkov 
Signed-off-by: Johan Hovold 
---
 drivers/phy/qualcomm/phy-qcom-qmp-combo.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c 
b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c
index 1ad10110dd25..e19d6a084f10 100644
--- a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c
+++ b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c
@@ -3566,10 +3566,6 @@ static int qmp_combo_probe(struct platform_device *pdev)
if (ret)
return ret;
 
-   ret = drm_aux_bridge_register(dev);
-   if (ret)
-   return ret;
-
/* Check for legacy binding with child nodes. */
usb_np = of_get_child_by_name(dev->of_node, "usb3-phy");
if (usb_np) {
@@ -3589,6 +3585,10 @@ static int qmp_combo_probe(struct platform_device *pdev)
if (ret)
goto err_node_put;
 
+   ret = drm_aux_bridge_register(dev);
+   if (ret)
+   goto err_node_put;
+
pm_runtime_set_active(dev);
ret = devm_pm_runtime_enable(dev);
if (ret)
-- 
2.43.0



[PATCH 4/6] soc: qcom: pmic_glink: Fix boot when QRTR=m

2024-02-17 Thread Johan Hovold
From: Rob Clark 

We need to bail out before adding/removing devices if we are going to
-EPROBE_DEFER. Otherwise boot can get stuck in a probe deferral loop due
to a long-standing issue in driver core (see fbc35b45f9f6 ("Add
documentation on meaning of -EPROBE_DEFER")).

Deregistering the altmode child device can potentially also trigger bugs
in the DRM bridge implementation, which does not expect bridges to go
away.

Suggested-by: Dmitry Baryshkov 
Signed-off-by: Rob Clark 
Link: https://lore.kernel.org/r/20231213210644.8702-1-robdcl...@gmail.com
[ johan: rebase on 6.8-rc4, amend commit message and mention DRM ]
Fixes: 58ef4ece1e41 ("soc: qcom: pmic_glink: Introduce base PMIC GLINK driver")
Cc: sta...@vger.kernel.org  # 6.3
Cc: Bjorn Andersson 
Signed-off-by: Johan Hovold 
---
 drivers/soc/qcom/pmic_glink.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/soc/qcom/pmic_glink.c b/drivers/soc/qcom/pmic_glink.c
index f4bfd24386f1..f913e9bd57ed 100644
--- a/drivers/soc/qcom/pmic_glink.c
+++ b/drivers/soc/qcom/pmic_glink.c
@@ -265,10 +265,17 @@ static int pmic_glink_probe(struct platform_device *pdev)
 
pg->client_mask = *match_data;
 
+   pg->pdr = pdr_handle_alloc(pmic_glink_pdr_callback, pg);
+   if (IS_ERR(pg->pdr)) {
+   ret = dev_err_probe(>dev, PTR_ERR(pg->pdr),
+   "failed to initialize pdr\n");
+   return ret;
+   }
+
if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_UCSI)) {
ret = pmic_glink_add_aux_device(pg, >ucsi_aux, "ucsi");
if (ret)
-   return ret;
+   goto out_release_pdr_handle;
}
if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_ALTMODE)) {
ret = pmic_glink_add_aux_device(pg, >altmode_aux, 
"altmode");
@@ -281,17 +288,11 @@ static int pmic_glink_probe(struct platform_device *pdev)
goto out_release_altmode_aux;
}
 
-   pg->pdr = pdr_handle_alloc(pmic_glink_pdr_callback, pg);
-   if (IS_ERR(pg->pdr)) {
-   ret = dev_err_probe(>dev, PTR_ERR(pg->pdr), "failed to 
initialize pdr\n");
-   goto out_release_aux_devices;
-   }
-
service = pdr_add_lookup(pg->pdr, "tms/servreg", "msm/adsp/charger_pd");
if (IS_ERR(service)) {
ret = dev_err_probe(>dev, PTR_ERR(service),
"failed adding pdr lookup for 
charger_pd\n");
-   goto out_release_pdr_handle;
+   goto out_release_aux_devices;
}
 
mutex_lock(&__pmic_glink_lock);
@@ -300,8 +301,6 @@ static int pmic_glink_probe(struct platform_device *pdev)
 
return 0;
 
-out_release_pdr_handle:
-   pdr_handle_release(pg->pdr);
 out_release_aux_devices:
if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_BATT))
pmic_glink_del_aux_device(pg, >ps_aux);
@@ -311,6 +310,8 @@ static int pmic_glink_probe(struct platform_device *pdev)
 out_release_ucsi_aux:
if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_UCSI))
pmic_glink_del_aux_device(pg, >ucsi_aux);
+out_release_pdr_handle:
+   pdr_handle_release(pg->pdr);
 
return ret;
 }
-- 
2.43.0



[PATCH 3/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free

2024-02-17 Thread Johan Hovold
A recent DRM series purporting to simplify support for "transparent
bridges" and handling of probe deferrals ironically exposed a
use-after-free issue on pmic_glink_altmode probe deferral.

This has manifested itself as the display subsystem occasionally failing
to initialise and NULL-pointer dereferences during boot of machines like
the Lenovo ThinkPad X13s.

Specifically, the dp-hpd bridge is currently registered before all
resources have been acquired which means that it can also be
deregistered on probe deferrals.

In the meantime there is a race window where the new aux bridge driver
(or PHY driver previously) may have looked up the dp-hpd bridge and
stored a (non-reference-counted) pointer to the bridge which is about to
be deallocated.

When the display controller is later initialised, this triggers a
use-after-free when attaching the bridges:

dp -> aux -> dp-hpd (freed)

which may, for example, result in the freed bridge failing to attach:

[drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge 
/soc@0/phy@88eb000 to encoder TMDS-31: -16

or a NULL-pointer dereference:

Unable to handle kernel NULL pointer dereference at virtual address 

...
Call trace:
  drm_bridge_attach+0x70/0x1a8 [drm]
  drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
  drm_bridge_attach+0x80/0x1a8 [drm]
  dp_bridge_init+0xa8/0x15c [msm]
  msm_dp_modeset_init+0x28/0xc4 [msm]

The DRM bridge implementation is clearly fragile and implicitly built on
the assumption that bridges may never go away. In this case, the fix is
to move the bridge registration in the pmic_glink_altmode driver to
after all resources have been looked up.

Incidentally, with the new dp-hpd bridge implementation, which registers
child devices, this is also a requirement due to a long-standing issue
in driver core that can otherwise lead to a probe deferral loop (see
fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")).

Fixes: 080b4e24852b ("soc: qcom: pmic_glink: Introduce altmode support")
Fixes: 2bcca96abfbf ("soc: qcom: pmic-glink: switch to DRM_AUX_HPD_BRIDGE")
Cc: sta...@vger.kernel.org  # 6.3
Cc: Bjorn Andersson 
Cc: Dmitry Baryshkov 
Signed-off-by: Johan Hovold 
---
 drivers/soc/qcom/pmic_glink_altmode.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/soc/qcom/pmic_glink_altmode.c 
b/drivers/soc/qcom/pmic_glink_altmode.c
index 5fcd0fdd2faa..b3808fc24c69 100644
--- a/drivers/soc/qcom/pmic_glink_altmode.c
+++ b/drivers/soc/qcom/pmic_glink_altmode.c
@@ -76,7 +76,7 @@ struct pmic_glink_altmode_port {
 
struct work_struct work;
 
-   struct device *bridge;
+   struct auxiliary_device *bridge;
 
enum typec_orientation orientation;
u16 svid;
@@ -230,7 +230,7 @@ static void pmic_glink_altmode_worker(struct work_struct 
*work)
else
pmic_glink_altmode_enable_usb(altmode, alt_port);
 
-   drm_aux_hpd_bridge_notify(alt_port->bridge,
+   drm_aux_hpd_bridge_notify(_port->bridge->dev,
  alt_port->hpd_state ?
  connector_status_connected :
  connector_status_disconnected);
@@ -454,7 +454,7 @@ static int pmic_glink_altmode_probe(struct auxiliary_device 
*adev,
alt_port->index = port;
INIT_WORK(_port->work, pmic_glink_altmode_worker);
 
-   alt_port->bridge = drm_dp_hpd_bridge_register(dev, 
to_of_node(fwnode));
+   alt_port->bridge = devm_drm_dp_hpd_bridge_alloc(dev, 
to_of_node(fwnode));
if (IS_ERR(alt_port->bridge)) {
fwnode_handle_put(fwnode);
return PTR_ERR(alt_port->bridge);
@@ -510,6 +510,16 @@ static int pmic_glink_altmode_probe(struct 
auxiliary_device *adev,
}
}
 
+   for (port = 0; port < ARRAY_SIZE(altmode->ports); port++) {
+   alt_port = >ports[port];
+   if (!alt_port->bridge)
+   continue;
+
+   ret = devm_drm_dp_hpd_bridge_add(dev, alt_port->bridge);
+   if (ret)
+   return ret;
+   }
+
altmode->client = devm_pmic_glink_register_client(dev,
  altmode->owner_id,
  
pmic_glink_altmode_callback,
-- 
2.43.0



[PATCH 1/6] drm/bridge: aux-hpd: fix OF node leaks

2024-02-17 Thread Johan Hovold
The two device node references taken during allocation need to be
dropped when the auxiliary device is freed.

Fixes: 6914968a0b52 ("drm/bridge: properly refcount DT nodes in aux bridge 
drivers")
Cc: Dmitry Baryshkov 
Cc: Neil Armstrong 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/bridge/aux-hpd-bridge.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/bridge/aux-hpd-bridge.c 
b/drivers/gpu/drm/bridge/aux-hpd-bridge.c
index bb55f697a181..9e71daf95bde 100644
--- a/drivers/gpu/drm/bridge/aux-hpd-bridge.c
+++ b/drivers/gpu/drm/bridge/aux-hpd-bridge.c
@@ -25,6 +25,7 @@ static void drm_aux_hpd_bridge_release(struct device *dev)
ida_free(_aux_hpd_bridge_ida, adev->id);
 
of_node_put(adev->dev.platform_data);
+   of_node_put(adev->dev.of_node);
 
kfree(adev);
 }
@@ -74,6 +75,8 @@ struct device *drm_dp_hpd_bridge_register(struct device 
*parent,
 
ret = auxiliary_device_init(adev);
if (ret) {
+   of_node_put(adev->dev.platform_data);
+   of_node_put(adev->dev.of_node);
ida_free(_aux_hpd_bridge_ida, adev->id);
kfree(adev);
return ERR_PTR(ret);
-- 
2.43.0



[PATCH 6/6] phy: qcom-qmp-combo: fix type-c switch registration

2024-02-17 Thread Johan Hovold
Due to a long-standing issue in driver core, drivers may not probe defer
after having registered child devices to avoid triggering a probe
deferral loop (see fbc35b45f9f6 ("Add documentation on meaning of
-EPROBE_DEFER")).

Move registration of the typec switch to after looking up clocks and
other resources.

Note that PHY creation can in theory also trigger a probe deferral when
a 'phy' supply is used. This does not seem to affect the QMP PHY driver
but the PHY subsystem should be reworked to address this (i.e. by
separating initialisation and registration of the PHY).

Fixes: 2851117f8f42 ("phy: qcom-qmp-combo: Introduce orientation switching")
Cc: sta...@vger.kernel.org  # 6.5
Cc: Bjorn Andersson 
Signed-off-by: Johan Hovold 
---
 drivers/phy/qualcomm/phy-qcom-qmp-combo.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c 
b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c
index e19d6a084f10..17c4ad7553a5 100644
--- a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c
+++ b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c
@@ -3562,10 +3562,6 @@ static int qmp_combo_probe(struct platform_device *pdev)
if (ret)
return ret;
 
-   ret = qmp_combo_typec_switch_register(qmp);
-   if (ret)
-   return ret;
-
/* Check for legacy binding with child nodes. */
usb_np = of_get_child_by_name(dev->of_node, "usb3-phy");
if (usb_np) {
@@ -3585,6 +3581,10 @@ static int qmp_combo_probe(struct platform_device *pdev)
if (ret)
goto err_node_put;
 
+   ret = qmp_combo_typec_switch_register(qmp);
+   if (ret)
+   goto err_node_put;
+
ret = drm_aux_bridge_register(dev);
if (ret)
goto err_node_put;
-- 
2.43.0



Re: [PATCH] drm/msm: Wire up tlb ops

2024-02-14 Thread Johan Hovold
On Tue, Feb 13, 2024 at 09:23:40AM -0800, Rob Clark wrote:
> From: Rob Clark 
> 
> The brute force iommu_flush_iotlb_all() was good enough for unmap, but
> in some cases a map operation could require removing a table pte entry
> to replace with a block entry.  This also requires tlb invalidation.
> Missing this was resulting an obscure iova fault on what should be a
> valid buffer address.
> 
> Thanks to Robin Murphy for helping me understand the cause of the fault.
> 
> Cc: Robin Murphy 
> Fixes: b145c6e65eb0 ("drm/msm: Add support to create a local pagetable")

Sounds like you're missing a

Cc: sta...@vger.kernel.org

here? Or is there some reason not to backport this fix (to 5.9 and later
kernels)?

> Signed-off-by: Rob Clark 

Johan


Re: drm/msm: DisplayPort regressions in 6.8-rc1

2024-02-14 Thread Johan Hovold
On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:

> I do agree that pm runtime eDP driver got merged that time but I think 
> the issue is either a combination of that along with DRM aux bridge 
> https://patchwork.freedesktop.org/series/122584/ OR just the latter as 
> even that went in around the same time.

Yes, indeed there was a lot of changes that went into the MSM drm driver
in 6.8-rc1 and since I have not tried to debug this myself I can't say
for sure which change or changes that triggered this regression (or
possibly regressions).

The fact that the USB-C/DP PHY appears to be involved
(/soc@0/phy@88eb000) could indeed point to the series you mentioned.

> Thats why perhaps this issue was not seen with the chromebooks we tested 
> on as they do not use pmic_glink (aux bridge).
> 
> So we will need to debug this on sc8280xp specifically or an equivalent 
> device which uses aux bridge.

I've hit the NULL-pointer deference three times now in the last few days
on the sc8280xp CRD. But since it doesn't trigger on every boot it seems
you need to go back to the series that could potentially have caused
this regression and review them again. There's clearly something quite
broken here.

> On 2/13/2024 3:42 AM, Johan Hovold wrote:

> > Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
> > not always show up on boot.

> > [6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach 
> > bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16

> > and this can also manifest itself as a NULL-pointer dereference:
> > 
> > [7.339447] Unable to handle kernel NULL pointer dereference at 
> > virtual address 
> > 
> > [7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]
> > [7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
> > 
> > [7.769039] Call trace:
> > [7.771564]  drm_bridge_attach+0x70/0x1a8 [drm]
> > [7.776234]  drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
> > [7.781782]  drm_bridge_attach+0x80/0x1a8 [drm]
> > [7.786454]  dp_bridge_init+0xa8/0x15c [msm]
> > [7.790856]  msm_dp_modeset_init+0x28/0xc4 [msm]
> > [7.795617]  _dpu_kms_drm_obj_init+0x19c/0x680 [msm]
> > [7.800731]  dpu_kms_hw_init+0x348/0x4c4 [msm]
> > [7.805306]  msm_drm_kms_init+0x84/0x324 [msm]
> > [7.809891]  msm_drm_bind+0x1d8/0x3a8 [msm]
> > [7.814196]  try_to_bring_up_aggregate_device+0x1f0/0x2f8
> > [7.819747]  __component_add+0xa4/0x18c
> > [7.823703]  component_add+0x14/0x20
> > [7.827389]  dp_display_probe+0x47c/0x568 [msm]
> > [7.832052]  platform_probe+0x68/0xd8
> > 
> > Users have also reported random crashes at boot since 6.8-rc1, and I've
> > been able to trigger hard crashes twice when testing an external display
> > (USB-C/DP), which may also be related to the DP regressions.

Johan


drm/msm: DisplayPort regressions in 6.8-rc1

2024-02-13 Thread Johan Hovold
Hi,

Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
not always show up on boot.

The logs indicate problems with the runtime PM and eDP rework that went
into 6.8-rc1:

[6.006236] Console: switching to colour dummy device 80x25
[6.007542] [drm:dpu_kms_hw_init:1048] dpu hardware 
revision:0x8000
[6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach 
bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16
[6.007934] [drm:dp_bridge_init [msm]] *ERROR* failed to attach 
panel bridge: -16
[6.007983] msm_dpu ae01000.display-controller: 
[drm:msm_dp_modeset_init [msm]] *ERROR* failed to create dp bridge: -16
[6.008030] [drm:_dpu_kms_initialize_displayport:588] [dpu 
error]modeset_init failed for DP, rc = -16
[6.008050] [drm:_dpu_kms_setup_displays:681] [dpu 
error]initialize_DP failed, rc = -16
[6.008068] [drm:dpu_kms_hw_init:1153] [dpu error]modeset init 
failed: -16
[6.008388] msm_dpu ae01000.display-controller: 
[drm:msm_drm_kms_init [msm]] *ERROR* kms hw init failed: -16

and this can also manifest itself as a NULL-pointer dereference:

[7.339447] Unable to handle kernel NULL pointer dereference at 
virtual address 

[7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]
[7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge]

[7.769039] Call trace:
[7.771564]  drm_bridge_attach+0x70/0x1a8 [drm]
[7.776234]  drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
[7.781782]  drm_bridge_attach+0x80/0x1a8 [drm]
[7.786454]  dp_bridge_init+0xa8/0x15c [msm]
[7.790856]  msm_dp_modeset_init+0x28/0xc4 [msm]
[7.795617]  _dpu_kms_drm_obj_init+0x19c/0x680 [msm]
[7.800731]  dpu_kms_hw_init+0x348/0x4c4 [msm]
[7.805306]  msm_drm_kms_init+0x84/0x324 [msm]
[7.809891]  msm_drm_bind+0x1d8/0x3a8 [msm]
[7.814196]  try_to_bring_up_aggregate_device+0x1f0/0x2f8
[7.819747]  __component_add+0xa4/0x18c
[7.823703]  component_add+0x14/0x20
[7.827389]  dp_display_probe+0x47c/0x568 [msm]
[7.832052]  platform_probe+0x68/0xd8

Users have also reported random crashes at boot since 6.8-rc1, and I've
been able to trigger hard crashes twice when testing an external display
(USB-C/DP), which may also be related to the DP regressions.

I've opened an issue here:

https://gitlab.freedesktop.org/drm/msm/-/issues/51

but I also want Thorsten's help to track this so that it gets fixed
before 6.8 is released.

#regzbot introduced: v6.7..v6.8-rc1

The following series is likely the culprit:


https://lore.kernel.org/all/1701472789-25951-1-git-send-email-quic_khs...@quicinc.com/

Johan


Re: [PATCH] drm/msm/dpu: Add missing safe_lut_tbl in sc8180x catalog

2023-12-01 Thread Johan Hovold
On Thu, Nov 30, 2023 at 04:35:01PM -0800, Bjorn Andersson wrote:
> Similar to SC8280XP, the misconfigured SAFE logic causes rather
> significant delays in __arm_smmu_tlb_sync(), resulting in poor
> performance for things such as USB.
> 
> Introduce appropriate SAFE values for SC8180X to correct this.
> 
> Fixes: f3af2d6ee9ab ("drm/msm/dpu: Add SC8180x to hw catalog")

Missing CC stable tag?

> Signed-off-by: Bjorn Andersson 

Johan


Re: [PATCH v2 1/2] drm/msm/dp: don't touch DP subconnector property in eDP case

2023-11-15 Thread Johan Hovold
On Wed, Oct 25, 2023 at 12:23:09PM +0300, Dmitry Baryshkov wrote:
> From: Abel Vesa 
> 
> In case of the eDP connection there is no subconnetor and as such no
> subconnector property. Put drm_dp_set_subconnector_property() calls
> under the !is_edp condition.
> 
> Fixes: bfcc3d8f94f4 ("drm/msm/dp: support setting the DP subconnector type")
> Signed-off-by: Abel Vesa 
> Signed-off-by: Dmitry Baryshkov 

Reviewed-by: Johan Hovold 
Tested-by: Johan Hovold 


Re: [PATCH v2 2/2] drm/msm/dp: attach the DP subconnector property

2023-11-15 Thread Johan Hovold
On Wed, Oct 25, 2023 at 12:23:10PM +0300, Dmitry Baryshkov wrote:
> While developing and testing the commit bfcc3d8f94f4 ("drm/msm/dp:
> support setting the DP subconnector type") I had the patch [1] in my
> tree. I haven't noticed that it was a dependency for the commit in
> question. Mea culpa.

This also broke boot on the Lenovo ThinkPad X13s.

Would be nice to get this fixed ASAP so that further people don't have
to debug this known regression.
 
> Since the patch has not landed yet (and even was not reviewed)
> and since one of the bridges erroneously uses USB connector type instead
> of DP, attach the property directly from the MSM DP driver.
> 
> This fixes the following oops on DP HPD event:
> 
>  drm_object_property_set_value (drivers/gpu/drm/drm_mode_object.c:288)
>  dp_display_process_hpd_high (drivers/gpu/drm/msm/dp/dp_display.c:402)
>  dp_hpd_plug_handle.isra.0 (drivers/gpu/drm/msm/dp/dp_display.c:604)
>  hpd_event_thread (drivers/gpu/drm/msm/dp/dp_display.c:1110)
>  kthread (kernel/kthread.c:388)
>  ret_from_fork (arch/arm64/kernel/entry.S:858)

This only says where the oops happened, it doesn't necessarily in itself
indicate an oops at all or that in this case it's a NULL pointer
dereference.

On the X13s I'm seeing the NULL deref in a different path during boot,
and when this happens after a deferred probe (due to the panel lookup
mess) it hangs the machine, which makes it a bit of a pain to debug:

   Unable to handle kernel NULL pointer dereference at virtual address 
0060
   ...
   CPU: 4 PID: 57 Comm: kworker/u16:1 Not tainted 6.7.0-rc1 #4
   Hardware name: Qualcomm QRD, BIOS 6.0.220110.BOOT.MXF.1.1-00470-MAKENA-1 
01/10/2022
   ...
   Call trace:
drm_object_property_set_value+0x0/0x88 [drm]
dp_display_process_hpd_high+0xa0/0x14c [msm]
dp_hpd_plug_handle.constprop.0.isra.0+0x90/0x110 [msm]
dp_bridge_atomic_enable+0x184/0x21c [msm]
edp_bridge_atomic_enable+0x60/0x94 [msm]
drm_atomic_bridge_chain_enable+0x54/0xc8 [drm]
drm_atomic_helper_commit_modeset_enables+0x194/0x26c [drm_kms_helper]
msm_atomic_commit_tail+0x204/0x804 [msm]
commit_tail+0xa4/0x18c [drm_kms_helper]
drm_atomic_helper_commit+0x19c/0x1b0 [drm_kms_helper]
drm_atomic_commit+0xa4/0x104 [drm]
drm_client_modeset_commit_atomic+0x22c/0x298 [drm]
drm_client_modeset_commit_locked+0x60/0x1c0 [drm]
drm_client_modeset_commit+0x30/0x58 [drm]
__drm_fb_helper_restore_fbdev_mode_unlocked+0xbc/0xfc [drm_kms_helper]
drm_fb_helper_set_par+0x30/0x4c [drm_kms_helper]
fbcon_init+0x224/0x49c
visual_init+0xb0/0x108
do_bind_con_driver.isra.0+0x19c/0x38c
do_take_over_console+0x140/0x1ec
do_fbcon_takeover+0x6c/0xe4
fbcon_fb_registered+0x180/0x1f0
register_framebuffer+0x19c/0x228
__drm_fb_helper_initial_config_and_unlock+0x2e8/0x4e8 [drm_kms_helper]
drm_fb_helper_initial_config+0x3c/0x4c [drm_kms_helper]
msm_fbdev_client_hotplug+0x84/0xcc [msm]
drm_client_register+0x5c/0xa0 [drm]
msm_fbdev_setup+0x94/0x148 [msm]
msm_drm_bind+0x3d0/0x42c [msm]
try_to_bring_up_aggregate_device+0x1ec/0x2f4
__component_add+0xa8/0x194
component_add+0x14/0x20
dp_display_probe+0x278/0x41c [msm]

> [1] https://patchwork.freedesktop.org/patch/30/
> 
> Fixes: bfcc3d8f94f4 ("drm/msm/dp: support setting the DP subconnector type")
> Reviewed-by: Abhinav Kumar 
> Signed-off-by: Dmitry Baryshkov 

Reviewed-by: Johan Hovold 
Tested-by: Johan Hovold 

Johan


Re: [PATCH] drm/msm/dpu: Add missing safe_lut_tbl in sc8280xp catalog

2023-10-31 Thread Johan Hovold
On Mon, Oct 30, 2023 at 04:23:20PM -0700, Bjorn Andersson wrote:
> During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to
> typically take 1-2ms to complete. As expected this results in poor
> performance, something that has been mitigated by proposing running the
> iommu in non-strict mode (boot with iommu.strict=0).
> 
> This turns out to be related to the SAFE logic, and programming the QOS
> SAFE values in the DPU (per suggestion from Rob and Doug) reduces the
> TLB sync time to below 10us, which means significant less time spent
> with interrupts disabled and a significant boost in throughput.

I ran some tests with a gigabit ethernet adapter to get an idea of how
this performs in comparison to using lazy iommu mode ("non-strict"):

6.6 6.6-lazy6.6-dpu 6.6-dpu-lazy
iperf3 recv 114 941 941 941 MBit/s
iperf3 send 124 891 703 940 MBit/s

scp recv14.6110 110 111 MB/s
scp send12.598.991.5110 MB/s

This patch in itself indeed improves things quite a bit, but there is
still some performance that can be gained by using lazy iommu mode.

Notably, lazy mode with this patch applied appears to saturate the link
in both directions.

Tested-by: Johan Hovold 

Johan


Re: [PATCH v1] drm/msm/dp: do not reinitialize phy unless retry during link training

2023-10-13 Thread Johan Hovold
On Tue, Oct 03, 2023 at 11:10:59AM +0200, Johan Hovold wrote:
> On Tue, Aug 08, 2023 at 03:19:50PM -0700, Kuogee Hsieh wrote:
> > DP PHY re-initialization done using dp_ctrl_reinitialize_mainlink() will
> > cause PLL unlocked initially and then PLL gets locked at the end of
> > initialization. PLL_UNLOCKED interrupt will fire during this time if the
> > interrupt mask is enabled.
> > However currently DP driver link training implementation incorrectly
> > re-initializes PHY unconditionally during link training as the PHY was
> > already configured in dp_ctrl_enable_mainlink_clocks().
> > 
> > Fix this by re-initializing the PHY only if the previous link training
> > failed.
> > 
> > [drm:dp_aux_isr] *ERROR* Unexpected DP AUX IRQ 0x0100 when not busy
> > 
> > Fixes: c943b4948b58 ("drm/msm/dp: add displayPort driver support")
> > Closes: https://gitlab.freedesktop.org/drm/msm/-/issues/30
> > Signed-off-by: Kuogee Hsieh 
> 
> This fixes the above warning and avoids the unnecessary PHY power-off
> and power-on during boot of the ThinkPad X13s:
> 
> Reviewed-by: Johan Hovold 
> Tested-by: Johan Hovold 
> 
> I guess this one should go to stable as well:
> 
> Cc: sta...@vger.kernel.org# 5.10
> 
> Is anyone planning on getting this fixed in 6.6-rc? I noticed that this
> one still hasn't shown up linux-next.

For the record, this one showed up in a PR from Rob and has now been
forwarded to Linus.

No stable tag included, but I guess we can ping the stable team unless
AUTOSEL picks this up.

Johan


Re: [PATCH v1] drm/msm/dp: do not reinitialize phy unless retry during link training

2023-10-03 Thread Johan Hovold
On Tue, Aug 08, 2023 at 03:19:50PM -0700, Kuogee Hsieh wrote:
> DP PHY re-initialization done using dp_ctrl_reinitialize_mainlink() will
> cause PLL unlocked initially and then PLL gets locked at the end of
> initialization. PLL_UNLOCKED interrupt will fire during this time if the
> interrupt mask is enabled.
> However currently DP driver link training implementation incorrectly
> re-initializes PHY unconditionally during link training as the PHY was
> already configured in dp_ctrl_enable_mainlink_clocks().
> 
> Fix this by re-initializing the PHY only if the previous link training
> failed.
> 
> [drm:dp_aux_isr] *ERROR* Unexpected DP AUX IRQ 0x0100 when not busy
> 
> Fixes: c943b4948b58 ("drm/msm/dp: add displayPort driver support")
> Closes: https://gitlab.freedesktop.org/drm/msm/-/issues/30
> Signed-off-by: Kuogee Hsieh 

This fixes the above warning and avoids the unnecessary PHY power-off
and power-on during boot of the ThinkPad X13s:

Reviewed-by: Johan Hovold 
Tested-by: Johan Hovold 

I guess this one should go to stable as well:

Cc: sta...@vger.kernel.org  # 5.10

Is anyone planning on getting this fixed in 6.6-rc? I noticed that this
one still hasn't shown up linux-next.

> ---
>  drivers/gpu/drm/msm/dp/dp_ctrl.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/dp/dp_ctrl.c 
> b/drivers/gpu/drm/msm/dp/dp_ctrl.c
> index a7a5c7e..77a8d93 100644
> --- a/drivers/gpu/drm/msm/dp/dp_ctrl.c
> +++ b/drivers/gpu/drm/msm/dp/dp_ctrl.c
> @@ -1774,13 +1774,6 @@ int dp_ctrl_on_link(struct dp_ctrl *dp_ctrl)
>   return rc;
>  
>   while (--link_train_max_retries) {
> - rc = dp_ctrl_reinitialize_mainlink(ctrl);
> - if (rc) {
> - DRM_ERROR("Failed to reinitialize mainlink. rc=%d\n",
> - rc);
> - break;
> - }
> -
>   training_step = DP_TRAINING_NONE;
>   rc = dp_ctrl_setup_main_link(ctrl, _step);
>   if (rc == 0) {
> @@ -1832,6 +1825,12 @@ int dp_ctrl_on_link(struct dp_ctrl *dp_ctrl)
>   /* stop link training before start re training  */
>   dp_ctrl_clear_training_pattern(ctrl);
>   }
> +
> + rc = dp_ctrl_reinitialize_mainlink(ctrl);
> + if (rc) {
> + DRM_ERROR("Failed to reinitialize mainlink. rc=%d\n", 
> rc);
> + break;
> + }
>   }
>  
>   if (ctrl->link->sink_request & DP_TEST_LINK_PHY_TEST_PATTERN)

Johan


Re: [PATCH] drm/msm/dp: Drop aux devices together with DP controller

2023-06-19 Thread Johan Hovold
On Mon, Jun 12, 2023 at 03:01:06PM -0700, Bjorn Andersson wrote:
> Using devres to depopulate the aux bus made sure that upon a probe
> deferral the EDP panel device would be destroyed and recreated upon next
> attempt.
> 
> But the struct device which the devres is tied to is the DPUs
> (drm_dev->dev), which may be happen after the DP controller is torn
> down.

There appears to be some words missing in this sentence.
 
> Indications of this can be seen in the commonly seen EDID-hexdump full
> of zeros in the log,

This could happen also when the aux bus lifetime was tied to DP
controller and is mostly benign as dp_aux_deinit() set the "initted"
flag to false.

> or the occasional/rare KASAN fault where the
> panel's attempt to read the EDID information causes a use after free on
> DP resources.

But this is clearly a bug as there's a small window where the aux bus
struct holding the above flag may also have been released...

> It's tempting to move the devres to the DP controller's struct device,
> but the resources used by the device(s) on the aux bus are explicitly
> torn down in the error path. The KASAN-reported use-after-free also
> remains, as the DP aux "module" explicitly frees its devres-allocated
> memory in this code path.

Right, and this would also not work as the aux bus could remain
populated for the next bind attempt which would then fail (as described
in the commit message of the offending commit).

> As such, explicitly depopulate the aux bus in the error path, and in the
> component unbind path, to avoid these issues.

Sounds good.

> Fixes: 2b57f726611e ("drm/msm/dp: fix aux-bus EP lifetime")

This one should also have a stable tag:

Cc: sta...@vger.kernel.org  # 5.19

> Signed-off-by: Bjorn Andersson 
> ---
>  drivers/gpu/drm/msm/dp/dp_display.c | 14 +++---
>  1 file changed, 3 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/dp/dp_display.c 
> b/drivers/gpu/drm/msm/dp/dp_display.c
> index 3d8fa2e73583..bbb0550a022b 100644
> --- a/drivers/gpu/drm/msm/dp/dp_display.c
> +++ b/drivers/gpu/drm/msm/dp/dp_display.c
> @@ -322,6 +322,8 @@ static void dp_display_unbind(struct device *dev, struct 
> device *master,
>  
>   kthread_stop(dp->ev_tsk);
>  
> + of_dp_aux_depopulate_bus(dp->aux);

This may now be called without first having populated the bus, but looks
like that still works.

> +
>   dp_power_client_deinit(dp->power);
>   dp_unregister_audio_driver(dev, dp->audio);
>   dp_aux_unregister(dp->aux);

I know this one was merged while I was out-of-office last week, but for
the record:

Reviewed-by: Johan Hovold 
Tested-by: Johan Hovold 

Johan


Re: Adreno devfreq lockdep splat with 6.3-rc2

2023-06-09 Thread Johan Hovold
On Thu, Jun 08, 2023 at 02:17:45PM -0700, Rob Clark wrote:
> On Thu, Jun 8, 2023 at 7:12 AM Johan Hovold  wrote:

> > Have you had a chance to look at this regression yet? It prevents us
> > from using lockdep on the X13s as it is disabled as soon as we start
> > the GPU.
> 
> Hmm, curious what is different between x13s and sc7180/sc7280 things?

It seems like lockdep needs to hit the tear down path in order to
detect the circular lock dependency. Perhaps you don't hit that on your
sc7180/sc7280? 

It is due to the fact that the panel is looked up way too late so that
bind fails unless the panel driver is already loaded when the msm drm
driver probes.

Manually loading the panel driver before msm makes the splat go away.

> Or did lockdep recently get more clever (or more annotation)?

I think this is indeed a new problem related to some of the devfreq work
you did in 6.3-rc1 (e.g. fadcc3ab1302 ("drm/msm/gpu: Bypass PM QoS
constraint for idle clamp")).

> I did spend some time a while back trying to bring some sense to
> devfreq/pm-qos/icc locking:
> https://patchwork.freedesktop.org/series/115028/
> 
> but haven't had time to revisit that for a while

That's the series I link to below, but IIRC it did not look directly
applicable to the splat I see on X13s (e.g. does not involve
fs_reclaim).

> > On Wed, Mar 15, 2023 at 10:19:21AM +0100, Johan Hovold wrote:
> > >
> > > Since 6.3-rc2 (or possibly -rc1), I'm now seeing the below
> > > devfreq-related lockdep splat.
> > >
> > > I noticed that you posted a fix for something similar here:
> > >
> > >   
> > > https://lore.kernel.org/r/20230312204150.1353517-9-robdcl...@gmail.com
> > >
> > > but that particular patch makes no difference.
> > >
> > > From skimming the calltraces below and qos/devfreq related changes in
> > > 6.3-rc1 it seems like this could be related to:
> > >
> > >   fadcc3ab1302 ("drm/msm/gpu: Bypass PM QoS constraint for idle 
> > > clamp")

Johan


Re: Adreno devfreq lockdep splat with 6.3-rc2

2023-06-08 Thread Johan Hovold
Hi Rob,

Have you had a chance to look at this regression yet? It prevents us
from using lockdep on the X13s as it is disabled as soon as we start
the GPU.

On Wed, Mar 15, 2023 at 10:19:21AM +0100, Johan Hovold wrote:
> 
> Since 6.3-rc2 (or possibly -rc1), I'm now seeing the below
> devfreq-related lockdep splat.
> 
> I noticed that you posted a fix for something similar here:
> 
>   https://lore.kernel.org/r/20230312204150.1353517-9-robdcl...@gmail.com
> 
> but that particular patch makes no difference.
> 
> From skimming the calltraces below and qos/devfreq related changes in
> 6.3-rc1 it seems like this could be related to:
> 
>   fadcc3ab1302 ("drm/msm/gpu: Bypass PM QoS constraint for idle clamp")

Below is an updated splat from 6.4-rc5.

Johan

[ 2941.931507] ==
[ 2941.931509] WARNING: possible circular locking dependency detected
[ 2941.931513] 6.4.0-rc5 #64 Not tainted
[ 2941.931516] --
[ 2941.931518] ring0/359 is trying to acquire lock:
[ 2941.931520] 63310e35c078 (>lock){+.+.}-{3:3}, at: 
qos_min_notifier_call+0x28/0x88
[ 2941.931541] 
   but task is already holding lock:
[ 2941.931543] 63310e3cace8 (&(c->notifiers)->rwsem){}-{3:3}, at: 
blocking_notifier_call_chain+0x30/0x70
[ 2941.931553] 
   which lock already depends on the new lock.

[ 2941.931555] 
   the existing dependency chain (in reverse order) is:
[ 2941.931556] 
   -> #4 (&(c->notifiers)->rwsem){}-{3:3}:
[ 2941.931562]down_write+0x50/0x198
[ 2941.931567]blocking_notifier_chain_register+0x30/0x8c
[ 2941.931570]freq_qos_add_notifier+0x68/0x7c
[ 2941.931574]dev_pm_qos_add_notifier+0xa0/0xf8
[ 2941.931579]devfreq_add_device.part.0+0x360/0x5a4
[ 2941.931583]devm_devfreq_add_device+0x74/0xe0
[ 2941.931587]msm_devfreq_init+0xa0/0x154 [msm]
[ 2941.931624]msm_gpu_init+0x2fc/0x588 [msm]
[ 2941.931649]adreno_gpu_init+0x160/0x2d0 [msm]
[ 2941.931675]a6xx_gpu_init+0x2c0/0x74c [msm]
[ 2941.931699]adreno_bind+0x180/0x290 [msm]
[ 2941.931723]component_bind_all+0x124/0x288
[ 2941.931728]msm_drm_bind+0x1d8/0x6cc [msm]
[ 2941.931752]try_to_bring_up_aggregate_device+0x1ec/0x2f4
[ 2941.931755]__component_add+0xa8/0x194
[ 2941.931758]component_add+0x14/0x20
[ 2941.931761]dp_display_probe+0x2b4/0x474 [msm]
[ 2941.931785]platform_probe+0x68/0xd8
[ 2941.931789]really_probe+0x184/0x3c8
[ 2941.931792]__driver_probe_device+0x7c/0x16c
[ 2941.931794]driver_probe_device+0x3c/0x110
[ 2941.931797]__device_attach_driver+0xbc/0x158
[ 2941.931800]bus_for_each_drv+0x84/0xe0
[ 2941.931802]__device_attach+0xa8/0x1d4
[ 2941.931805]device_initial_probe+0x14/0x20
[ 2941.931807]bus_probe_device+0xb0/0xb4
[ 2941.931810]deferred_probe_work_func+0xa0/0xf4
[ 2941.931812]process_one_work+0x288/0x5bc
[ 2941.931816]worker_thread+0x74/0x450
[ 2941.931818]kthread+0x124/0x128
[ 2941.931822]ret_from_fork+0x10/0x20
[ 2941.931826] 
   -> #3 (dev_pm_qos_mtx){+.+.}-{3:3}:
[ 2941.931831]__mutex_lock+0xa0/0x840
[ 2941.931833]mutex_lock_nested+0x24/0x30
[ 2941.931836]dev_pm_qos_remove_notifier+0x34/0x140
[ 2941.931838]genpd_remove_device+0x3c/0x174
[ 2941.931841]genpd_dev_pm_detach+0x78/0x1b4
[ 2941.931844]dev_pm_domain_detach+0x24/0x34
[ 2941.931846]a6xx_gmu_remove+0x34/0xc4 [msm]
[ 2941.931869]a6xx_destroy+0xd0/0x160 [msm]
[ 2941.931892]adreno_unbind+0x40/0x64 [msm]
[ 2941.931916]component_unbind+0x38/0x6c
[ 2941.931919]component_unbind_all+0xc8/0xd4
[ 2941.931921]msm_drm_uninit.isra.0+0x150/0x1c4 [msm]
[ 2941.931945]msm_drm_bind+0x310/0x6cc [msm]
[ 2941.931967]try_to_bring_up_aggregate_device+0x1ec/0x2f4
[ 2941.931970]__component_add+0xa8/0x194
[ 2941.931973]component_add+0x14/0x20
[ 2941.931976]dp_display_probe+0x2b4/0x474 [msm]
[ 2941.932000]platform_probe+0x68/0xd8
[ 2941.932003]really_probe+0x184/0x3c8
[ 2941.932005]__driver_probe_device+0x7c/0x16c
[ 2941.932008]driver_probe_device+0x3c/0x110
[ 2941.932011]__device_attach_driver+0xbc/0x158
[ 2941.932014]bus_for_each_drv+0x84/0xe0
[ 2941.932016]__device_attach+0xa8/0x1d4
[ 2941.932018]device_initial_probe+0x14/0x20
[ 2941.932021]bus_probe_device+0xb0/0xb4
[ 2941.932023]deferred_probe_work_func+0xa0/0xf4
[ 2941.932026]process_one_work+0x288/0x5bc
[ 2941.932028]worker_thread+0x74/0x450
[ 2941.932031]kthread+0x124/0x128
[ 2941.932035]ret_from_fork+0x10/0x20
[ 2941.932037] 
  

Re: [PATCH] Revert "drm/msm/dp: set self refresh aware based on PSR support"

2023-06-05 Thread Johan Hovold
On Mon, Jun 05, 2023 at 01:05:36PM +0300, Dmitry Baryshkov wrote:
> On Mon, 5 Jun 2023 at 13:02, Johan Hovold  wrote:

> > Virtual terminals are still broken with 6.4-rc5 on the Lenovo ThinkPad
> > X13s two weeks after I reported this, and there has been no indication
> > of any progress in the other related thread:
> >
> > https://lore.kernel.org/lkml/zhyphnwodbxb-...@hovoldconsulting.com
> >
> > Seems like it is time to merge this revert to get this sorted.
> >
> > Rob, Abhinav, Dmitry, can either of you merge this one and get it into
> > 6.4-rc6?
> 
> Rob sent the pull request few hours ago, see
> https://lore.kernel.org/dri-devel/caf6aeguhujkfjra6ys36uyh0kur4hd16u1emqjo8toz3ifv...@mail.gmail.com/

Ok, so you guys went with the module parameter hack. Whatever. As long
as the regression is finally fixed.

Next time, some visibility into your process would be appreciated to
avoid unnecessary work.

Johan


Re: [PATCH] Revert "drm/msm/dp: set self refresh aware based on PSR support"

2023-06-05 Thread Johan Hovold
[ +CC: Thorsten and regzbot so they can help with tracking this
regression ]

#regzbot introduced: v6.3..v6.4-rc1

On Tue, May 23, 2023 at 05:16:46PM +0200, Johan Hovold wrote:
> This reverts commit 1844e680d56bb0c4e0489138f2b7ba2dc1c988e3.
> 
> PSR support clearly is not ready for mainline and specifically breaks
> virtual terminals which are no longer updated when PSR is enabled (e.g.
> no keyboard input is echoed, no cursor blink).
> 
> Disable PSR support for now by reverting commit 1844e680d56b
> ("drm/msm/dp: set self refresh aware based on PSR support").
> 
> Cc: Vinod Polimera 
> Cc: Dmitry Baryshkov 
> Signed-off-by: Johan Hovold 
> ---
> 
> Bjorn reported that PSR support broke virtual terminals two months ago, 
> but this is still broken in 6.4-rc3:
> 
>   https://lore.kernel.org/lkml/20230326162723.3lo6pnsfdwzsvbhj@ripper/
> 
> despite the following series that claimed to address this:
> 
>   
> https://lore.kernel.org/lkml/1680271114-1534-1-git-send-email-quic_vpoli...@quicinc.com
> 
> Let's revert until this has been fixed properly.

Virtual terminals are still broken with 6.4-rc5 on the Lenovo ThinkPad
X13s two weeks after I reported this, and there has been no indication
of any progress in the other related thread:

https://lore.kernel.org/lkml/zhyphnwodbxb-...@hovoldconsulting.com

Seems like it is time to merge this revert to get this sorted.

Rob, Abhinav, Dmitry, can either of you merge this one and get it into
6.4-rc6?

Johan


Re: [PATCH] drm/msm/a6xx: fix uninitialised lock in init error path

2023-06-01 Thread Johan Hovold
On Wed, May 31, 2023 at 07:22:49AM -0700, Doug Anderson wrote:
> Hi,
> 
> On Wed, May 31, 2023 at 1:00 AM Johan Hovold  wrote:
> >
> > A recent commit started taking the GMU lock in the GPU destroy path,
> > which on GPU initialisation failure is called before the GMU and its
> > lock have been initialised.
> >
> > Make sure that the GMU has been initialised before taking the lock in
> > a6xx_destroy() and drop the now redundant check from a6xx_gmu_remove().
> >
> > Fixes: 4cd15a3e8b36 ("drm/msm/a6xx: Make GPU destroy a bit safer")
> > Cc: sta...@vger.kernel.org  # 6.3
> > Cc: Douglas Anderson 
> > Signed-off-by: Johan Hovold 
> > ---
> >  drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 3 ---
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 9 ++---
> >  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> I think Dmitry already posted a patch 1.5 months ago to fix this.
> 
> https://lore.kernel.org/r/20230410165908.3094626-1-dmitry.barysh...@linaro.org

Bah, I checked if Bjorn had hit this with his recent A690 v3 series and
posted a fix, but did not look further than that.

> Can you confirm that works for you?

That looks like it would work too, but I think I prefer my version which
keeps the initialisation of the GMU struct in a6xx_gmu_init().

Dmitry or Rob, could you see to that either version gets merged soon so
that we don't end up with even more people having to debug and fix the
same issue?

Johan


[PATCH] drm/msm/a6xx: fix uninitialised lock in init error path

2023-05-31 Thread Johan Hovold
A recent commit started taking the GMU lock in the GPU destroy path,
which on GPU initialisation failure is called before the GMU and its
lock have been initialised.

Make sure that the GMU has been initialised before taking the lock in
a6xx_destroy() and drop the now redundant check from a6xx_gmu_remove().

Fixes: 4cd15a3e8b36 ("drm/msm/a6xx: Make GPU destroy a bit safer")
Cc: sta...@vger.kernel.org  # 6.3
Cc: Douglas Anderson 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 3 ---
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 9 ++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c 
b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index e16b4b3f8535..105ccf17041f 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -1472,9 +1472,6 @@ void a6xx_gmu_remove(struct a6xx_gpu *a6xx_gpu)
struct a6xx_gmu *gmu = _gpu->gmu;
struct platform_device *pdev = to_platform_device(gmu->dev);
 
-   if (!gmu->initialized)
-   return;
-
pm_runtime_force_suspend(gmu->dev);
 
/*
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 9fb214f150dd..ee47b95a0205 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1684,6 +1684,7 @@ static void a6xx_destroy(struct msm_gpu *gpu)
 {
struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
+   struct a6xx_gmu *gmu = _gpu->gmu;
 
if (a6xx_gpu->sqe_bo) {
msm_gem_unpin_iova(a6xx_gpu->sqe_bo, gpu->aspace);
@@ -1697,9 +1698,11 @@ static void a6xx_destroy(struct msm_gpu *gpu)
 
a6xx_llc_slices_destroy(a6xx_gpu);
 
-   mutex_lock(_gpu->gmu.lock);
-   a6xx_gmu_remove(a6xx_gpu);
-   mutex_unlock(_gpu->gmu.lock);
+   if (gmu->initialized) {
+   mutex_lock(>lock);
+   a6xx_gmu_remove(a6xx_gpu);
+   mutex_unlock(>lock);
+   }
 
adreno_gpu_cleanup(adreno_gpu);
 
-- 
2.39.3



Re: [PATCH v3 0/3] drm/msm/adreno: GPU support on SC8280XP

2023-05-31 Thread Johan Hovold
On Tue, May 30, 2023 at 08:09:42PM -0700, Bjorn Andersson wrote:
> This series introduces support for A690 in the DRM/MSM driver and
> enables it for the two SC8280XP laptops.
> 
> Bjorn Andersson (3):
>   drm/msm/adreno: Add Adreno A690 support
>   arm64: dts: qcom: sc8280xp: Add GPU related nodes
>   arm64: dts: qcom: sc8280xp: Enable GPU related nodes

Seems to work well (after applying the dependency mentioned in the dtsi
patch):

Tested-by: Johan Hovold 

Johan


Re: [PATCH v3 2/3] arm64: dts: qcom: sc8280xp: Add GPU related nodes

2023-05-31 Thread Johan Hovold
On Tue, May 30, 2023 at 08:09:44PM -0700, Bjorn Andersson wrote:
> From: Bjorn Andersson 
> 
> Add Adreno SMMU, GPU clock controller, GMU and GPU nodes for the
> SC8280XP.
> 
> Tested-by: Steev Klimaszewski 
> Signed-off-by: Bjorn Andersson 
> Signed-off-by: Bjorn Andersson 
> ---
> 
> Changes since v2:
> - Added missing opp level (both gpu and gmu)
> - Corrected opp-level for highest gpu opp
> - Added dma-coherent to gpu smmu
> 
> Note that in order for the GPU driver to probe, the last change
> requires:
> https://lore.kernel.org/linux-arm-msm/20230410185226.3240336-1-dmitry.barysh...@linaro.org/

That's a pretty well-hidden notice about a critical dependency. I just
spent the morning debugging why this series broke the probe of the GPU
and only saw this when I was going to report my findings.

Please consider putting information like this in the cover letter in the
future.

> Changes since v1:
> - Dropped gmu_pdc_seq region from , as it shouldn't have been used.
> - Added missing compatible to _smmu.
> - Dropped aoss_qmp clock in  and _smmu.

Changelogs are also preferably placed in the cover letter so that you
don't have to read through N patches to determine what changed from one
revision of a series to the next.

Johan


Re: [PATCH] drm/msm/dp: add module parameter for PSR

2023-05-30 Thread Johan Hovold
On Wed, May 24, 2023 at 10:13:33AM -0700, Doug Anderson wrote:
> On Wed, May 24, 2023 at 1:06 AM Dmitry Baryshkov
>  wrote:

> > Originally this issue was reported by Doug, and at [1] he reported that
> > an issue is fixed for him. So, for me it looks like we have hardware
> > where VT works and hardware where it doesn't.
> 
> As I understand it, the problem was originally reported by Bjorn over
> IRC. When he reported the problem on VT2, Stephen Boyd confirmed that
> he could see the same problem on our hardware when using VT2. I
> confirmed I could reproduce and also tested the fix. I don't remember
> if Bjorn ever tested the fix. I think many of us assumed that it was
> the same issue so a fix for one person would also fix the other.

Yeah, that sounds reasonable even if there apparently are some
differences between mainline and what you run in ChromeOS here.

> > Doug, can you please confirm whether you can reproduce the PSR+VT issue
> > on 6.4-rc (without extra patches) or if the issue is fixed for you?
> >
> > [1]
> > https://lore.kernel.org/dri-devel/CAD=FV=vshmqptsqfwjviezeerms-veotmfozejasuc9zsmj...@mail.gmail.com/
> 
> Ugh. Unfortunately, it's not easy for me to test this particular
> feature directly on upstream Linux. :( I typically run with a ChromeOS
> root filesystem, which works pretty well with mainline. One place
> where it doesn't work with mainline is VT2. The VT2 feature for
> Chromebooks is implemented with "frecon" and takes advantage of a
> downstream patch that we've carried in the ChromeOS kernel trees for
> years allowing handoff of who the DRM "master" is.
> 
> For testing the fix previously, I confirmed that I could reproduce the
> problem on our downstream kernel (which had the PSR patches picked)
> and that the fixes worked for me in that context.
> 
> Ah, but it shouldn't be too hard to pick that one patch. Let's see if
> that works. I'll pick ("CHROMIUM: drm: Add drm_master_relax debugfs
> file (non-root set/drop master ioctls)"). Indeed, it works!
> 
> OK, so with the top of Linus's tree (v6.4-rc3-17-g9d646009f65d) plus
> the CHROMIUM patch to enable my VT2, I can confirm that I don't see
> the issue. I can switch to VT2 and typing works fine. I will say that
> on herobrine the backlight is all messed up, but I assume that's an
> unrelated issue.

Interesting, as VTs are still broken in with rc4 on the X13s (and
sc8280xp CRD).

Does anyone have any ideas of why things break on mainline with these
machines? Any patches or tests I can try?

Dmitry, do you have an X13s now that you can reproduce this on?

Johan


Re: [PATCH] drm/msm/dp: add module parameter for PSR

2023-05-24 Thread Johan Hovold
On Wed, May 24, 2023 at 11:06:03AM +0300, Dmitry Baryshkov wrote:
> On 24/05/2023 09:59, Johan Hovold wrote:

> > Regressions happen and sometimes there are corner cases that are harder
> > to find, but this is a breakage of a fundamental feature that was
> > reported before the code was even merged into mainline.
> > 
> >> We should have ideally gone with the modparam with the feature patches
> >> itself knowing that it gets enabled for all sinks if PSR is supported.
> > 
> > Modparams are things of the past should not be used to enable broken
> > features so that some vendor can tick of their internal lists of
> > features that have been "mainlined".
> 
> We have had a history of using modparam with i915 and IIRC amdgpu / 
> radeon drivers to allow users to easily check whether new feature works 
> for their hardware. My current understanding is that PSR+VT works for on 
> some laptops and doesn't on some other laptops, which makes it a valid case.

But here it does not seem to be the hardware that's the issue, but
rather that the implementation is incorrect or incomplete.

Johan


Re: [PATCH] drm/msm/dp: add module parameter for PSR

2023-05-24 Thread Johan Hovold
On Tue, May 23, 2023 at 12:23:04PM -0700, Abhinav Kumar wrote:
> On 5/23/2023 8:24 AM, Johan Hovold wrote:
> > On Fri, May 12, 2023 at 09:13:04PM +0300, Dmitry Baryshkov wrote:
> >> On 28/04/2023 02:28, Abhinav Kumar wrote:
> >>> On sc7280 where eDP is the primary display, PSR is causing
> >>> IGT breakage even for basic test cases like kms_atomic and
> >>> kms_atomic_transition. Most often the issue starts with below
> >>> stack so providing that as reference
> >>>
> >>> Call trace:

> >>> ---[ end trace  ]---
> >>> [drm-dp] dp_ctrl_push_idle: PUSH_IDLE pattern timedout
> >>>
> >>> Other basic use-cases still seem to work fine hence add a
> >>> a module parameter to allow toggling psr enable/disable till
> >>> PSR related issues are hashed out with IGT.
> >>
> >> For the reference: Bjorn reported that he has issues with VT on a
> >> PSR-enabled laptops. This patch fixes the issue for him
> > 
> > Module parameters are almost never warranted, and it is definitely not
> > the right way to handle a broken implementation.
> > 
> > I've just sent a revert that unconditionally disables PSR support until
> > the implementation has been fixed:
> > 
> > 
> > https://lore.kernel.org/lkml/20230523151646.28366-1-johan+lin...@kernel.org/
> 
> I dont completely agree with this. Even the virtual terminal case was 
> reported to be fixed by one user but not the other. So it was probably 
> something missed out either in validation or reproduction steps of the 
> user who reported it to be fixed OR the user who reported it not fixed. 
> That needs to be investigated now.

Yes, there may still be some time left to fix it, but it's pretty damn
annoying to find that an issue reported two months ago still is not
fixed at 6.4-rc3. (I even waited to make the switch to 6.4 so that I
would not have to spend time on this.)

I didn't see any mail from Bjorn saying that the series that claimed to
fix the VT issue actually did fix the VT issue. There's only the comment
above from Dmitry suggesting that disabling this feature is the only way
to get a working terminal back.

Regressions happen and sometimes there are corner cases that are harder
to find, but this is a breakage of a fundamental feature that was
reported before the code was even merged into mainline.

> We should have ideally gone with the modparam with the feature patches 
> itself knowing that it gets enabled for all sinks if PSR is supported.

Modparams are things of the past should not be used to enable broken
features so that some vendor can tick of their internal lists of
features that have been "mainlined".

You can carry that single patch out-of-tree to enable this if you need
it for some particular use case where you don't care about VTs.

But hopefully you can just get this sorted quickly. If not, the revert I
posted is the way to go rather than adding random module parameters.

Johan


Re: [PATCH] drm/msm/dp: add module parameter for PSR

2023-05-23 Thread Johan Hovold
On Fri, May 12, 2023 at 09:13:04PM +0300, Dmitry Baryshkov wrote:
> On 28/04/2023 02:28, Abhinav Kumar wrote:
> > On sc7280 where eDP is the primary display, PSR is causing
> > IGT breakage even for basic test cases like kms_atomic and
> > kms_atomic_transition. Most often the issue starts with below
> > stack so providing that as reference
> > 
> > Call trace:
> >   dpu_encoder_assign_crtc+0x64/0x6c
> >   dpu_crtc_enable+0x188/0x204
> >   drm_atomic_helper_commit_modeset_enables+0xc0/0x274
> >   msm_atomic_commit_tail+0x1a8/0x68c
> >   commit_tail+0xb0/0x160
> >   drm_atomic_helper_commit+0x11c/0x124
> >   drm_atomic_commit+0xb0/0xdc
> >   drm_atomic_connector_commit_dpms+0xf4/0x110
> >   drm_mode_obj_set_property_ioctl+0x16c/0x3b0
> >   drm_connector_property_set_ioctl+0x4c/0x74
> >   drm_ioctl_kernel+0xec/0x15c
> >   drm_ioctl+0x264/0x408
> >   __arm64_sys_ioctl+0x9c/0xd4
> >   invoke_syscall+0x4c/0x110
> >   el0_svc_common+0x94/0xfc
> >   do_el0_svc+0x3c/0xb0
> >   el0_svc+0x2c/0x7c
> >   el0t_64_sync_handler+0x48/0x114
> >   el0t_64_sync+0x190/0x194
> > ---[ end trace  ]---
> > [drm-dp] dp_ctrl_push_idle: PUSH_IDLE pattern timedout
> > 
> > Other basic use-cases still seem to work fine hence add a
> > a module parameter to allow toggling psr enable/disable till
> > PSR related issues are hashed out with IGT.
> 
> For the reference: Bjorn reported that he has issues with VT on a 
> PSR-enabled laptops. This patch fixes the issue for him

Module parameters are almost never warranted, and it is definitely not
the right way to handle a broken implementation.

I've just sent a revert that unconditionally disables PSR support until
the implementation has been fixed:


https://lore.kernel.org/lkml/20230523151646.28366-1-johan+lin...@kernel.org/

Johan


[PATCH] Revert "drm/msm/dp: set self refresh aware based on PSR support"

2023-05-23 Thread Johan Hovold
This reverts commit 1844e680d56bb0c4e0489138f2b7ba2dc1c988e3.

PSR support clearly is not ready for mainline and specifically breaks
virtual terminals which are no longer updated when PSR is enabled (e.g.
no keyboard input is echoed, no cursor blink).

Disable PSR support for now by reverting commit 1844e680d56b
("drm/msm/dp: set self refresh aware based on PSR support").

Cc: Vinod Polimera 
Cc: Dmitry Baryshkov 
Signed-off-by: Johan Hovold 
---

Bjorn reported that PSR support broke virtual terminals two months ago, 
but this is still broken in 6.4-rc3:

https://lore.kernel.org/lkml/20230326162723.3lo6pnsfdwzsvbhj@ripper/

despite the following series that claimed to address this:


https://lore.kernel.org/lkml/1680271114-1534-1-git-send-email-quic_vpoli...@quicinc.com

Let's revert until this has been fixed properly.

Johan


 drivers/gpu/drm/msm/dp/dp_drm.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/msm/dp/dp_drm.c b/drivers/gpu/drm/msm/dp/dp_drm.c
index 785d76639497..029e08c5bb06 100644
--- a/drivers/gpu/drm/msm/dp/dp_drm.c
+++ b/drivers/gpu/drm/msm/dp/dp_drm.c
@@ -117,8 +117,6 @@ static int edp_bridge_atomic_check(struct drm_bridge 
*drm_bridge,
if (WARN_ON(!conn_state))
return -ENODEV;
 
-   conn_state->self_refresh_aware = dp->psr_supported;
-
if (!conn_state->crtc || !crtc_state)
return 0;
 
-- 
2.39.3



Re: [PATCH v2] drm/msm/adreno: adreno_gpu: Use suspend() instead of idle() on load error

2023-03-31 Thread Johan Hovold
On Fri, Mar 31, 2023 at 01:15:16AM +0200, Konrad Dybcio wrote:
> The adreno_load_gpu() path is guarded by an error check on
> adreno_load_fw(). This function is responsible for loading
> Qualcomm-only-signed binaries (e.g. SQE and GMU FW for A6XX), but it
> does not take the vendor-signed ZAP blob into account.
> 
> By embedding the SQE (and GMU, if necessary) firmware into the
> initrd/kernel, we can trigger and unfortunate path that would not bail
> out early and proceed with gpu->hw_init(). That will fail, as the ZAP
> loader path will not find the firmware and return back to
> adreno_load_gpu().
> 
> This error path involves pm_runtime_put_sync() which then calls idle()
> instead of suspend(). This is suboptimal, as it means that we're not
> going through the clean shutdown sequence. With at least A619_holi, this
> makes the GPU not wake up until it goes through at least one more
> start-fail-stop cycle. The pm_runtime_put_sync that appears in the error
> path actually does not guarantee that because of the earlier enabling of
> runtime autosuspend.
> 
> Fix that by using pm_runtime_put_sync_suspend to force a clean shutdown.
> 
> Test cases:
> 1. All firmware baked into kernel
> 2. error loading ZAP fw in initrd -> load from rootfs at DE start
> 
> Both succeed on A619_holi (SM6375) and A630 (SDM845).
> 
> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load")

As this one is marked for stable, you also need:

Cc: sta...@vger.kernel.org  # 6.0

> Signed-off-by: Konrad Dybcio 

Reviewed-by: Johan Hovold 

> ---
> v1 -> v2:
> - Improve the commit message and the reasoning within
> 
>  drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c 
> b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index f61896629be6..59f3302e8167 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev)
>   return gpu;
>  
>  err_put_rpm:
> - pm_runtime_put_sync(>dev);
> + pm_runtime_put_sync_suspend(>dev);
>  err_disable_rpm:
>   pm_runtime_disable(>dev);

Johan


Re: [PATCH] drm/msm/adreno: adreno_gpu: Use suspend() instead of idle() on load error

2023-03-30 Thread Johan Hovold
On Wed, Mar 29, 2023 at 10:45:52PM +0300, Dmitry Baryshkov wrote:
> On Wed, 29 Mar 2023 at 18:48, Konrad Dybcio  wrote:
> > On 29.03.2023 16:37, Johan Hovold wrote:
> > > On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote:
> > >> If we fail to initialize the GPU for whatever reason (say we don't
> > >> embed the GPU firmware files in the initrd), the error path involves
> > >> pm_runtime_put_sync() which then calls idle() instead of suspend().
> > >>
> > >> This is suboptimal, as it means that we're not going through the
> > >> clean shutdown sequence. With at least A619_holi, this makes the GPU
> > >> not wake up until it goes through at least one more start-fail-stop
> > >> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean
> > >> shutdown.
> > >
> > > This does not sound right. If pm_runtime_put_sync() fails to suspend the
> > > device when the usage count drops to zero, then you have a bug somewhere
> > > else.
> > I was surprised to see that it was not called as well, but I wasn't able
> > to track it down before..
> 
> Could you please check that it's autosuspend who kicks in? In other
> words, if we disable autosuspend, the pm_runtime_put_sync is enough()?

Yes, that's it. The runtime PM implementation changed at one point and
since you need to disable autosuspend first to actually get synchronous
behaviour. My bad.

> That would probably mean that we lack some kind of reset in the hw_init path.
> 
> On the other hand, I do not know how the device will react to the
> error-in-the-middle state. Modems for example, can enter the state
> where you can not properly turn it off once it starts the boot
> process.
> 
> And if we remember the efforts that Akhil has put into making sure
> that the GPU is properly reset in case of an _error_, it might be
> nearly impossible to shut it down in a proper way.
> 
> Thus said, I think that unless there is an obvious way to restart the
> init process, Korad's pm_runtime_put_sync_suspend() looks like a
> correct fix to me.

I'd prefer to fix this by disabling autosuspend, but as that would
involve also moving the call to enable autosuspend to this function (and
add the missing disable on unbind), Konrad's patch using
pm_runtime_put_sync_suspend() is probably the best option for now. I can
send a patch to move the autosuspend handling on top.

Perhaps you can just amend the commit message to clarify that not all fw
is apparently preloaded and also mention that you need to use
pm_runtime_put_sync_suspend() due to autosuspend being enabled.

Johan


Re: [PATCH] drm/msm/adreno: adreno_gpu: Use suspend() instead of idle() on load error

2023-03-29 Thread Johan Hovold
On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote:
> If we fail to initialize the GPU for whatever reason (say we don't
> embed the GPU firmware files in the initrd), the error path involves
> pm_runtime_put_sync() which then calls idle() instead of suspend().
> 
> This is suboptimal, as it means that we're not going through the
> clean shutdown sequence. With at least A619_holi, this makes the GPU
> not wake up until it goes through at least one more start-fail-stop
> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean
> shutdown.

This does not sound right. If pm_runtime_put_sync() fails to suspend the
device when the usage count drops to zero, then you have a bug somewhere
else.

Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware
before bringing up the hardware") the firmware is loaded before even
hitting these paths so the above description does not sound right in
that respect either (or is missing some details).

> Test cases:
> 1. firmware baked into kernel
> 2. error loading fw in initrd -> load from rootfs at DE start
> 
> Both succeed on A619_holi (SM6375) and A630 (SDM845).
> 
> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load")
> Signed-off-by: Konrad Dybcio 
> ---
>  drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c 
> b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index f61896629be6..59f3302e8167 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev)
>   return gpu;
>  
>  err_put_rpm:
> - pm_runtime_put_sync(>dev);
> + pm_runtime_put_sync_suspend(>dev);
>  err_disable_rpm:
>   pm_runtime_disable(>dev);

Johan


Re: [PATCH 00/10] drm/msm: fix bind error handling

2023-03-22 Thread Johan Hovold
On Tue, Mar 21, 2023 at 05:21:56PM +0200, Dmitry Baryshkov wrote:
> On 21/03/2023 15:02, Johan Hovold wrote:
> > On Mon, Mar 06, 2023 at 11:07:12AM +0100, Johan Hovold wrote:
> >> I had reasons to look closer at the MSM DRM driver error handling and
> >> realised that it had suffered from a fair amount of bit rot over the
> >> years.
> >>
> >> Unfortunately, I started fixing this in my 6.2 branch and failed to
> >> notice two partial and, as it turned out, broken attempts to address
> >> this that are now in 6.3-rc1.
> >>
> >> Instead of trying to salvage this incrementally, I'm reverting the two
> >> broken commits so that clean and backportable fixes can be added in
> >> their place.
> >>
> >> Included are also two related cleanups.
> > 
> > Any further comments to these patches (except for 9/10, which should be
> > dropped)?
> > 
> > As the patches being reverted here were first added in 6.3-rc1 there is
> > still time to get this into 6.3-rc (e.g. before AUTOSEL starts trying to
> > backport them).
> 
> I will take a look at the patches. Additional question, as you have been 
> looking into this area. We have plenty of code which is only called 
> under the `if (kms)` condition. Could you hopefully move these parts to 
> separate functions, so that the error handling is also simpler? If not, 
> I'll put this to my todo list, but it might take some time before I have 
> time for that.

There's definitely room for cleaning up the bind/unbind paths further,
but for this series I focus on correctness while maintaining symmetry
(e.g. if an allocation was done under if (kms), then the release should
be done under the same).

I don't think I will have time to look at this further for a few weeks
either, but I'll add it to my list of future work as well and I'll check
in with you before actually working on it.

Johan


Re: [PATCH 05/10] drm/msm: fix drm device leak on bind errors

2023-03-22 Thread Johan Hovold
On Tue, Mar 21, 2023 at 04:54:51PM +0200, Dmitry Baryshkov wrote:
> On 06/03/2023 12:07, Johan Hovold wrote:
> > Make sure to free the DRM device also in case of early errors during
> > bind().
> > 
> > Fixes: 2027e5b3413d ("drm/msm: Initialize MDSS irq domain at probe time")
> > Cc: sta...@vger.kernel.org  # 5.17
> > Cc: Dmitry Baryshkov 
> > Signed-off-by: Johan Hovold 
> 
> Can we migrate to devm_drm_dev_alloc instead() ? Will it make code 
> simpler and/or easier to handle?

I'm just fixing the bugs here. Cleanups/rework like that can be done on
top but should not be backported as it risks introducing new issues.

Johan


Re: [PATCH] drm/meson: fix missing component unbind on bind errors

2023-03-21 Thread Johan Hovold
On Tue, Mar 21, 2023 at 02:27:08PM +0100, Neil Armstrong wrote:

> Anyway for this patch, sorry for the delay, but it's looks fine:
> 
> Acked-by: Neil Armstrong 

Thanks for reviewing!

Johan


Re: [PATCH] drm/meson: fix missing component unbind on bind errors

2023-03-21 Thread Johan Hovold
On Thu, Mar 09, 2023 at 10:41:18PM +0100, Martin Blumenstingl wrote:

> On Mon, Mar 6, 2023 at 11:35 AM Johan Hovold  wrote:
> [...]
> > @@ -325,23 +325,23 @@ static int meson_drv_bind_master(struct device *dev, 
> > bool has_components)
> >
> > ret = meson_encoder_hdmi_init(priv);

> I'm wondering if component_bind_all() can be moved further down.
> Right now it's between meson_encoder_cvbs_init() and
> meson_encoder_hdmi_init(). So it seems that encoders don't rely on
> component registration.

Perhaps it can, but that would be a separate change (unless there is
something inherently wrong with the current initialisation order).
 
> Unfortunately I am also not familiar with this and I'm hoping that
> Neil can comment on this.

Any comments on this one, Neil?

Johan


Re: [PATCH 00/10] drm/msm: fix bind error handling

2023-03-21 Thread Johan Hovold
On Mon, Mar 06, 2023 at 11:07:12AM +0100, Johan Hovold wrote:
> I had reasons to look closer at the MSM DRM driver error handling and
> realised that it had suffered from a fair amount of bit rot over the
> years.
> 
> Unfortunately, I started fixing this in my 6.2 branch and failed to
> notice two partial and, as it turned out, broken attempts to address
> this that are now in 6.3-rc1.
> 
> Instead of trying to salvage this incrementally, I'm reverting the two
> broken commits so that clean and backportable fixes can be added in
> their place.
> 
> Included are also two related cleanups.

Any further comments to these patches (except for 9/10, which should be
dropped)?

As the patches being reverted here were first added in 6.3-rc1 there is
still time to get this into 6.3-rc (e.g. before AUTOSEL starts trying to
backport them).

Johan

> Johan Hovold (10):
>   Revert "drm/msm: Add missing check and destroy for
> alloc_ordered_workqueue"
>   Revert "drm/msm: Fix failure paths in msm_drm_init()"
>   drm/msm: fix NULL-deref on snapshot tear down
>   drm/msm: fix NULL-deref on irq uninstall
>   drm/msm: fix drm device leak on bind errors
>   drm/msm: fix vram leak on bind errors
>   drm/msm: fix missing wq allocation error handling
>   drm/msm: fix workqueue leak on bind errors
>   drm/msm: use drmm_mode_config_init()
>   drm/msm: move include directive
> 
>  drivers/gpu/drm/msm/disp/msm_disp_snapshot.c |  3 -
>  drivers/gpu/drm/msm/msm_drv.c| 67 +---
>  2 files changed, 44 insertions(+), 26 deletions(-)


Re: [PATCH v2 0/4] drm/msm/adreno: fix runtime PM imbalance at unbind

2023-03-21 Thread Johan Hovold
On Fri, Mar 03, 2023 at 05:48:03PM +0100, Johan Hovold wrote:
> As reported by Bjorn, we can end up with an unbalanced runtime PM
> disable count if unbind() is called before the DRM device is opened
> (e.g. if component bind fails due to the panel driver not having been
> loaded yet).
> 
> As runtime PM must currently stay disabled until the firmware has been
> loaded, fix this by making the runtime PM disable call at unbind()
> conditional.
> 
> The rest of the series fixes further imbalances in the load_gpu() error
> paths and removes a bogus pm_runtime_set_active() call. Included is also
> a related indentation cleanup.

I noticed that Rob picked up the first patch below from v1 of this
series. Any comments to the remaining three?

Johan

> Changes in v2
>  - fix the runtime PM imbalance in the gpu load error paths (new)
> 
>  - drop the patch removing the pm_runtime_disable() from
>adreno_gpu_cleanup() as this function can currently still be called
>with runtime PM enabled if suspending the scheduler in
>    adreno_system_suspend() at unbind fails
> 
> 
> Johan Hovold (4):
>   drm/msm/adreno: fix runtime PM imbalance at unbind
>   drm/msm/adreno: fix runtime PM imbalance at gpu load
>   drm/msm/adreno: drop bogus pm_runtime_set_active()
>   drm/msm/adreno: clean up component ops indentation
> 
>  drivers/gpu/drm/msm/adreno/adreno_device.c | 26 +-
>  1 file changed, 16 insertions(+), 10 deletions(-)


Adreno devfreq lockdep splat with 6.3-rc2

2023-03-15 Thread Johan Hovold
Hi Rob,

Since 6.3-rc2 (or possibly -rc1), I'm now seeing the below
devfreq-related lockdep splat.

I noticed that you posted a fix for something similar here:

https://lore.kernel.org/r/20230312204150.1353517-9-robdcl...@gmail.com

but that particular patch makes no difference.

>From skimming the calltraces below and qos/devfreq related changes in
6.3-rc1 it seems like this could be related to:

fadcc3ab1302 ("drm/msm/gpu: Bypass PM QoS constraint for idle clamp")

Johan


[   35.389822] ==
[   35.389824] WARNING: possible circular locking dependency detected
[   35.389826] 6.3.0-rc2 #208 Not tainted
[   35.389828] --
[   35.389829] ring0/348 is trying to acquire lock:
[   35.389830] 43424cfa2078 (>lock){+.+.}-{3:3}, at: 
qos_min_notifier_call+0x28/0x88
[   35.389845] 
   but task is already holding lock:
[   35.389846] 4342432b78e8 (&(c->notifiers)->rwsem){}-{3:3}, at: 
blocking_notifier_call_chain+0x34/0xa0
[   35.389855] 
   which lock already depends on the new lock.

[   35.389856] 
   the existing dependency chain (in reverse order) is:
[   35.389857] 
   -> #4 (&(c->notifiers)->rwsem){}-{3:3}:
[   35.389861]lock_acquire+0x68/0x84
[   35.389865]down_write+0x58/0xfc
[   35.389869]blocking_notifier_chain_register+0x30/0x8c
[   35.389872]freq_qos_add_notifier+0x68/0x7c
[   35.389876]dev_pm_qos_add_notifier+0xe8/0x114
[   35.389881]devfreq_add_device.part.0+0x360/0x5a4
[   35.389884]devm_devfreq_add_device+0x74/0xe0
[   35.389886]msm_devfreq_init+0xa0/0x154 [msm]
[   35.389915]msm_gpu_init+0x320/0x5b0 [msm]
[   35.389933]adreno_gpu_init+0x164/0x2d8 [msm]
[   35.389951]a6xx_gpu_init+0x270/0x608 [msm]
[   35.389968]adreno_bind+0x184/0x284 [msm]
[   35.389983]component_bind_all+0x124/0x288
[   35.389989]msm_drm_bind+0x1d8/0x6a8 [msm]
[   35.390004]try_to_bring_up_aggregate_device+0x1ec/0x2f4
[   35.390007]__component_add+0xa8/0x194
[   35.390010]component_add+0x14/0x20
[   35.390012]dp_display_probe+0x2b4/0x474 [msm]
[   35.390029]platform_probe+0x68/0xd8
[   35.390031]really_probe+0x184/0x3c8
[   35.390034]__driver_probe_device+0x7c/0x188
[   35.390036]driver_probe_device+0x3c/0x110
[   35.390039]__device_attach_driver+0xbc/0x158
[   35.390041]bus_for_each_drv+0x84/0xe0
[   35.390044]__device_attach+0xa8/0x1d4
[   35.390046]device_initial_probe+0x14/0x20
[   35.390049]bus_probe_device+0xac/0xb0
[   35.390051]deferred_probe_work_func+0xa0/0xf4
[   35.390053]process_one_work+0x288/0x6c4
[   35.390056]worker_thread+0x74/0x450
[   35.390058]kthread+0x118/0x11c
[   35.390060]ret_from_fork+0x10/0x20
[   35.390063] 
   -> #3 (dev_pm_qos_mtx){+.+.}-{3:3}:
[   35.390066]lock_acquire+0x68/0x84
[   35.390068]__mutex_lock+0x98/0x428
[   35.390072]mutex_lock_nested+0x2c/0x38
[   35.390074]dev_pm_qos_remove_notifier+0x34/0x140
[   35.390077]genpd_remove_device+0x3c/0x174
[   35.390081]genpd_dev_pm_detach+0x78/0x1b4
[   35.390083]dev_pm_domain_detach+0x24/0x34
[   35.390085]a6xx_gmu_remove+0x64/0xd0 [msm]
[   35.390101]a6xx_destroy+0xa8/0x138 [msm]
[   35.390116]adreno_unbind+0x40/0x64 [msm]
[   35.390131]component_unbind+0x38/0x6c
[   35.390134]component_unbind_all+0xc8/0xd4
[   35.390136]msm_drm_uninit.isra.0+0x168/0x1dc [msm]
[   35.390152]msm_drm_bind+0x2f4/0x6a8 [msm]
[   35.390167]try_to_bring_up_aggregate_device+0x1ec/0x2f4
[   35.390170]__component_add+0xa8/0x194
[   35.390172]component_add+0x14/0x20
[   35.390175]dp_display_probe+0x2b4/0x474 [msm]
[   35.390190]platform_probe+0x68/0xd8
[   35.390192]really_probe+0x184/0x3c8
[   35.390194]__driver_probe_device+0x7c/0x188
[   35.390197]driver_probe_device+0x3c/0x110
[   35.390199]__device_attach_driver+0xbc/0x158
[   35.390201]bus_for_each_drv+0x84/0xe0
[   35.390203]__device_attach+0xa8/0x1d4
[   35.390206]device_initial_probe+0x14/0x20
[   35.390208]bus_probe_device+0xac/0xb0
[   35.390210]deferred_probe_work_func+0xa0/0xf4
[   35.390212]process_one_work+0x288/0x6c4
[   35.390214]worker_thread+0x74/0x450
[   35.390216]kthread+0x118/0x11c
[   35.390217]ret_from_fork+0x10/0x20
[   35.390219] 
   -> #2 (>lock){+.+.}-{3:3}:
[   35.390222]lock_acquire+0x68/0x84
[   35.390224]__mutex_lock+0x98/0x428
[   35.390226]mutex_lock_nested+0x2c/0x38
[   35.390229]a6xx_gpu_set_freq+0x30/0x5c [msm]
[   35.390245]

Re: [PATCH v3 5/7] drm/msm/hdmi: stop using drm_bridge_connector_en/disable_hpd()

2023-03-08 Thread Johan Hovold
On Wed, Nov 02, 2022 at 09:07:03PM +0300, Dmitry Baryshkov wrote:
> The functionality of drm_bridge_connector_enable_hpd() and
> drm_bridge_connector_disable_hpd() is provided automatically by the
> drm_kms_poll helpers. Stop calling these functions manually.

I stumbled over this one when investigating a hotplug-related crash in
the MSM DRM driver which this series prevents by moving hotplug
notification enable to drm_kms_helper_poll_enable().

> Signed-off-by: Dmitry Baryshkov 
> ---
>  drivers/gpu/drm/msm/hdmi/hdmi.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/hdmi/hdmi.c b/drivers/gpu/drm/msm/hdmi/hdmi.c
> index 93fe61b86967..a540c45d4fd3 100644
> --- a/drivers/gpu/drm/msm/hdmi/hdmi.c
> +++ b/drivers/gpu/drm/msm/hdmi/hdmi.c
> @@ -348,8 +348,6 @@ int msm_hdmi_modeset_init(struct hdmi *hdmi,
>   goto fail;
>   }
>  
> - drm_bridge_connector_enable_hpd(hdmi->connector);
> -
>   ret = msm_hdmi_hpd_enable(hdmi->bridge);
>   if (ret < 0) {
>   DRM_DEV_ERROR(>pdev->dev, "failed to enable HPD: %d\n", 
> ret);

It looks like you are now enabling hotplug events before the DRM driver
is ready to receive them (i.e. msm_hdmi_hpd_enable() is called before
drm_bridge_connector_enable_hpd()).

Could this not lead to missed events or is the state being setup
correctly somewhere else?

Shouldn't the call to msm_hdmi_hpd_enable() be moved to when HPD is
enabled either way (e.g. by being converted to a hpd_enable callback)?

Johan


Re: [PATCH] drm/msm: Initialize mode_config earlier

2023-03-08 Thread Johan Hovold
On Thu, Mar 02, 2023 at 03:17:04PM -0800, Bjorn Andersson wrote:
> On Wed, Mar 01, 2023 at 02:58:50PM +0100, Johan Hovold wrote:

> > So after debugging this issue a third time, I can conclude that it is
> > still very much present in 6.2.
> > 
> > It appears you looked at the linux-next tree when you concluded that
> > this patch was not needed. In 6.2 the bridge->hpd_cb callback is set
> > before mode_config.funcs is initialised as part of
> > kms->funcs->hw_init(kms).
> > 
> > The hpd DRM changes heading into 6.3 do appear to avoid the NULL-pointer
> > dereference by moving the bridge->hpd_cb initialisation to
> > drm_kms_helper_poll_init() as you mention above.

I can confirm that as expected my reproducer no longer triggers with
6.3-rc1.
 
> > The PMIC GLINK altmode driver still happily forwards notifications
> > regardless of the DRM driver state though, which can lead to missed
> > hotplug events. It seems you need to implement the
> > hpd_enable()/disable() callbacks and either cache or not enable events
> > in fw until the DRM driver is ready.
> > 
> 
> It's not clear to me what the expectation from the DRM framework is on
> this point. We register a drm_bridge which is only capable of signaling
> HPD events (DRM_BRIDGE_OP_HPD), not querying HPD state (DRM_BRIDGE_OP_DETECT).

I think the assumption is that any bridge that can generate hotplug
events also has a way of detecting whether it is connected (i.e.
DRM_BRIDGE_OP_HPD => DRM_BRIDGE_OP_DETECT).

The pmic_glink_altmode driver appears to be the only driver that sets
DRM_BRIDGE_OP_HPD but not DRM_BRIDGE_OP_DETECT.

> Does this imply that any such bridge must ensure that hpd events are
> re-delivered once hpd_enable() has been invoked (we can't invoke it from
> hpd_enable...)?
> 
> Is it reasonable to do this retriggering in the altmode driver? Or is it
> the job of the TCPM (it seems reasonable to not send the PAN_EN message
> until we get hpd_enable()...)?

Are you sure there is no way to query the firmware about the connected
state?

Otherwise, enabling the notification messages when hpd_enable() is
called looks like it should work as the fw currently appears to always
send a disconnected event followed by a connect event if connected.

But that's not going to be enough unless you can also disable events in
fw on hpd_disable() so that the state can again be updated on the next
hpd_enable().

If that's not possible, it seems you need to cache the state in the
driver and hope you get a notification after a suspend cycle if the
state has changed.

But in any case, the DRM documentation is pretty clear on that a bridge
driver should not be calling drm_bridge_hpd_notify() until hpd_enable()
is called (and also not after hpd_disable()) as the pmic_glink_altmode
driver currently do.

hpd_enable

Enable hot plug detection. From now on the bridge shall call
drm_bridge_hpd_notify() each time a change is detected in the
output connection status, until hot plug detection gets disabled
with hpd_disable.

This callback is optional and shall only be implemented by
bridges that support hot-plug notification without polling.
Bridges that implement it shall also implement the hpd_disable
callback and set the DRM_BRIDGE_OP_HPD flag in their
drm_bridge->ops.


https://dri.freedesktop.org/docs/drm/gpu/drm-kms-helpers.html#c.drm_bridge_funcs

Johan


Re: [PATCH 01/10] Revert "drm/msm: Add missing check and destroy for alloc_ordered_workqueue"

2023-03-07 Thread Johan Hovold
On Wed, Mar 08, 2023 at 10:10:24AM +0800, Jiasheng Jiang wrote:
> On Mon, 06 Mar 2023 18:07:13 +0800, Johan Hovold wrote:
> > This reverts commit 643b7d0869cc7f1f7a5ac7ca6bd25d88f54e31d0.
> 
> The commit not only adds the allocation sanity check, but also adds the
> destroy_workqueue to release the allocated priv->wq.
> Therefore, revert the commit will cause memory leak.

No, reverting this commit does not cause any memory leaks (look at at
diff again).

The original patch called msm_drm_uninit() in some early error paths,
but that was just completely broken as that function must not be called
before the subcomponents have been bound and also triggered a bunch of
other NULL-pointer dereferences.

That bit was however removed when the change was merged with a second
branch that also touched these error paths. In the end, the leaked wq is
still there and this commit only added broken error handling for the wq
allocation failing (as it does not free the drm device).

> > A recent patch that tried to fix up the msm_drm_init() paths with
> > respect to the workqueue but only ended up making things worse:
> > 
> > First, the newly added calls to msm_drm_uninit() on early errors would
> > trigger NULL-pointer dereferences, for example, as the kms pointer would
> > not have been initialised. (Note that these paths were also modified by
> > a second broken error handling patch which in effect cancelled out this
> > part when merged.)
> 
> There is a check for the kms pointer to avoid NULL-pointer dereference in
> the msm_drm_uninit().

No, there were further places in msm_drm_uninit() which did not have any
such checks when you submitted your patch. Some of the missing checks
were added by a separate patch, but that would not in itself have been
sufficient as with your patch you'd still end up trying to unbind the
subcomponents before they are bound, which will lead to further crashes.
 
> > Second, the newly added allocation sanity check would still leak the
> > previously allocated drm device.
> 
> The ddev is allocated by drm_dev_alloc which support automatic cleanup.

We don't have automatic garbage collection in the kernel. You still need
to release the reference to the device for it to be freed.

Johan


Re: [PATCH 09/10] drm/msm: use drmm_mode_config_init()

2023-03-06 Thread Johan Hovold
On Mon, Mar 06, 2023 at 02:38:37PM +0200, Dmitry Baryshkov wrote:
> On 06/03/2023 12:07, Johan Hovold wrote:
> > Switch to using drmm_mode_config_init() so that the mode config is
> > released when the last reference to the DRM device is dropped rather
> > than unconditionally at unbind() (which may be too soon).
> 
> This also means that drm_bridge_detach() might be called at some point 
> after unbind(), which might be too late.

Indeed.

Please disregard this patch. It's not needed to fix the bind error paths
anyway.

Johan


[PATCH] drm/meson: fix missing component unbind on bind errors

2023-03-06 Thread Johan Hovold
Make sure to unbind all subcomponents when binding the aggregate device
fails.

Fixes: a41e82e6c457 ("drm/meson: Add support for components")
Cc: sta...@vger.kernel.org  # 4.12
Cc: Neil Armstrong 
Signed-off-by: Johan Hovold 
---

Note that this one has only been compile tested.

Johan


 drivers/gpu/drm/meson/meson_drv.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/meson/meson_drv.c 
b/drivers/gpu/drm/meson/meson_drv.c
index 79bfe3938d3c..7caf937c3c90 100644
--- a/drivers/gpu/drm/meson/meson_drv.c
+++ b/drivers/gpu/drm/meson/meson_drv.c
@@ -325,23 +325,23 @@ static int meson_drv_bind_master(struct device *dev, bool 
has_components)
 
ret = meson_encoder_hdmi_init(priv);
if (ret)
-   goto exit_afbcd;
+   goto unbind_all;
 
ret = meson_plane_create(priv);
if (ret)
-   goto exit_afbcd;
+   goto unbind_all;
 
ret = meson_overlay_create(priv);
if (ret)
-   goto exit_afbcd;
+   goto unbind_all;
 
ret = meson_crtc_create(priv);
if (ret)
-   goto exit_afbcd;
+   goto unbind_all;
 
ret = request_irq(priv->vsync_irq, meson_irq, 0, drm->driver->name, 
drm);
if (ret)
-   goto exit_afbcd;
+   goto unbind_all;
 
drm_mode_config_reset(drm);
 
@@ -359,6 +359,9 @@ static int meson_drv_bind_master(struct device *dev, bool 
has_components)
 
 uninstall_irq:
free_irq(priv->vsync_irq, drm);
+unbind_all:
+   if (has_components)
+   component_unbind_all(drm->dev, drm);
 exit_afbcd:
if (priv->afbcd.ops)
priv->afbcd.ops->exit(priv);
-- 
2.39.2



[PATCH] drm/sun4i: fix missing component unbind on bind errors

2023-03-06 Thread Johan Hovold
Make sure to unbind all subcomponents when binding the aggregate device
fails.

Fixes: 9026e0d122ac ("drm: Add Allwinner A10 Display Engine support")
Cc: sta...@vger.kernel.org  # 4.7
Cc: Maxime Ripard 
Signed-off-by: Johan Hovold 
---

Note that this one has only been compile tested.

Johan


 drivers/gpu/drm/sun4i/sun4i_drv.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/sun4i/sun4i_drv.c 
b/drivers/gpu/drm/sun4i/sun4i_drv.c
index cc94efbbf2d4..d6c741716167 100644
--- a/drivers/gpu/drm/sun4i/sun4i_drv.c
+++ b/drivers/gpu/drm/sun4i/sun4i_drv.c
@@ -95,12 +95,12 @@ static int sun4i_drv_bind(struct device *dev)
/* drm_vblank_init calls kcalloc, which can fail */
ret = drm_vblank_init(drm, drm->mode_config.num_crtc);
if (ret)
-   goto cleanup_mode_config;
+   goto unbind_all;
 
/* Remove early framebuffers (ie. simplefb) */
ret = drm_aperture_remove_framebuffers(false, _drv_driver);
if (ret)
-   goto cleanup_mode_config;
+   goto unbind_all;
 
sun4i_framebuffer_init(drm);
 
@@ -119,6 +119,8 @@ static int sun4i_drv_bind(struct device *dev)
 
 finish_poll:
drm_kms_helper_poll_fini(drm);
+unbind_all:
+   component_unbind_all(dev, NULL);
 cleanup_mode_config:
drm_mode_config_cleanup(drm);
of_reserved_mem_device_release(dev);
-- 
2.39.2



[PATCH 10/10] drm/msm: move include directive

2023-03-06 Thread Johan Hovold
Move the include of of_address.h to the top of the file where it
belongs.

Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/msm_drv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index ade17947d1e5..42ae7575622b 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -272,8 +273,6 @@ static int msm_drm_uninit(struct device *dev)
return 0;
 }
 
-#include 
-
 struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
 {
struct msm_gem_address_space *aspace;
-- 
2.39.2



[PATCH 01/10] Revert "drm/msm: Add missing check and destroy for alloc_ordered_workqueue"

2023-03-06 Thread Johan Hovold
This reverts commit 643b7d0869cc7f1f7a5ac7ca6bd25d88f54e31d0.

A recent patch that tried to fix up the msm_drm_init() paths with
respect to the workqueue but only ended up making things worse:

First, the newly added calls to msm_drm_uninit() on early errors would
trigger NULL-pointer dereferences, for example, as the kms pointer would
not have been initialised. (Note that these paths were also modified by
a second broken error handling patch which in effect cancelled out this
part when merged.)

Second, the newly added allocation sanity check would still leak the
previously allocated drm device.

Instead of trying to salvage what was badly broken (and clearly not
tested), let's revert the bad commit so that clean and backportable
fixes can be added in its place.

Fixes: 643b7d0869cc ("drm/msm: Add missing check and destroy for 
alloc_ordered_workqueue")
Cc: Jiasheng Jiang 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/msm_drv.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index aca48c868c14..b7f5a78eadd4 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -420,8 +420,6 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
priv->dev = ddev;
 
priv->wq = alloc_ordered_workqueue("msm", 0);
-   if (!priv->wq)
-   return -ENOMEM;
 
INIT_LIST_HEAD(>objects);
mutex_init(>obj_lock);
-- 
2.39.2



[PATCH 07/10] drm/msm: fix missing wq allocation error handling

2023-03-06 Thread Johan Hovold
Add the missing sanity check to handle workqueue allocation failures.

Fixes: c8afe684c95c ("drm/msm: basic KMS driver for snapdragon")
Cc: sta...@vger.kernel.org  # 3.12
Cc: Rob Clark 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/msm_drv.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index 41cc6cd690cd..ac3b77dbfacc 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -432,6 +432,10 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
priv->dev = ddev;
 
priv->wq = alloc_ordered_workqueue("msm", 0);
+   if (!priv->wq) {
+   ret = -ENOMEM;
+   goto err_put_dev;
+   }
 
INIT_LIST_HEAD(>objects);
mutex_init(>obj_lock);
-- 
2.39.2



[PATCH 05/10] drm/msm: fix drm device leak on bind errors

2023-03-06 Thread Johan Hovold
Make sure to free the DRM device also in case of early errors during
bind().

Fixes: 2027e5b3413d ("drm/msm: Initialize MDSS irq domain at probe time")
Cc: sta...@vger.kernel.org  # 5.17
Cc: Dmitry Baryshkov 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/msm_drv.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index 2f2bcdb671d2..89634159ad75 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -444,12 +444,12 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
 
ret = msm_init_vram(ddev);
if (ret)
-   return ret;
+   goto err_put_dev;
 
/* Bind all our sub-components: */
ret = component_bind_all(dev, ddev);
if (ret)
-   return ret;
+   goto err_put_dev;
 
dma_set_max_seg_size(dev, UINT_MAX);
 
@@ -544,6 +544,12 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
 
 err_msm_uninit:
msm_drm_uninit(dev);
+
+   return ret;
+
+err_put_dev:
+   drm_dev_put(ddev);
+
return ret;
 }
 
-- 
2.39.2



[PATCH 00/10] drm/msm: fix bind error handling

2023-03-06 Thread Johan Hovold
I had reasons to look closer at the MSM DRM driver error handling and
realised that it had suffered from a fair amount of bit rot over the
years.

Unfortunately, I started fixing this in my 6.2 branch and failed to
notice two partial and, as it turned out, broken attempts to address
this that are now in 6.3-rc1.

Instead of trying to salvage this incrementally, I'm reverting the two
broken commits so that clean and backportable fixes can be added in
their place.

Included are also two related cleanups.

Johan


Johan Hovold (10):
  Revert "drm/msm: Add missing check and destroy for
alloc_ordered_workqueue"
  Revert "drm/msm: Fix failure paths in msm_drm_init()"
  drm/msm: fix NULL-deref on snapshot tear down
  drm/msm: fix NULL-deref on irq uninstall
  drm/msm: fix drm device leak on bind errors
  drm/msm: fix vram leak on bind errors
  drm/msm: fix missing wq allocation error handling
  drm/msm: fix workqueue leak on bind errors
  drm/msm: use drmm_mode_config_init()
  drm/msm: move include directive

 drivers/gpu/drm/msm/disp/msm_disp_snapshot.c |  3 -
 drivers/gpu/drm/msm/msm_drv.c| 67 +---
 2 files changed, 44 insertions(+), 26 deletions(-)

-- 
2.39.2



[PATCH 04/10] drm/msm: fix NULL-deref on irq uninstall

2023-03-06 Thread Johan Hovold
In case of early initialisation errors and on platforms that do not use
the DPU controller, the deinitilisation code can be called with the kms
pointer set to NULL.

Fixes: f026e431cf86 ("drm/msm: Convert to Linux IRQ interfaces")
Cc: sta...@vger.kernel.org  # 5.14
Cc: Thomas Zimmermann 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/msm_drv.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index 17a59d73fe01..2f2bcdb671d2 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -251,9 +251,11 @@ static int msm_drm_uninit(struct device *dev)
drm_bridge_remove(priv->bridges[i]);
priv->num_bridges = 0;
 
-   pm_runtime_get_sync(dev);
-   msm_irq_uninstall(ddev);
-   pm_runtime_put_sync(dev);
+   if (kms) {
+   pm_runtime_get_sync(dev);
+   msm_irq_uninstall(ddev);
+   pm_runtime_put_sync(dev);
+   }
 
if (kms && kms->funcs)
kms->funcs->destroy(kms);
-- 
2.39.2



[PATCH 09/10] drm/msm: use drmm_mode_config_init()

2023-03-06 Thread Johan Hovold
Switch to using drmm_mode_config_init() so that the mode config is
released when the last reference to the DRM device is dropped rather
than unconditionally at unbind() (which may be too soon).

Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/msm_drv.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index 73c597565f99..ade17947d1e5 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -247,8 +247,6 @@ static int msm_drm_uninit(struct device *dev)
if (kms)
msm_disp_snapshot_destroy(ddev);
 
-   drm_mode_config_cleanup(ddev);
-
for (i = 0; i < priv->num_bridges; i++)
drm_bridge_remove(priv->bridges[i]);
priv->num_bridges = 0;
@@ -454,11 +452,13 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
might_lock(>lru.lock);
fs_reclaim_release(GFP_KERNEL);
 
-   drm_mode_config_init(ddev);
+   ret = drmm_mode_config_init(ddev);
+   if (ret)
+   goto err_destroy_wq;
 
ret = msm_init_vram(ddev);
if (ret)
-   goto err_cleanup_mode_config;
+   goto err_destroy_wq;
 
/* Bind all our sub-components: */
ret = component_bind_all(dev, ddev);
@@ -563,8 +563,7 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
 
 err_deinit_vram:
msm_deinit_vram(ddev);
-err_cleanup_mode_config:
-   drm_mode_config_cleanup(ddev);
+err_destroy_wq:
destroy_workqueue(priv->wq);
 err_put_dev:
drm_dev_put(ddev);
-- 
2.39.2



[PATCH 03/10] drm/msm: fix NULL-deref on snapshot tear down

2023-03-06 Thread Johan Hovold
In case of early initialisation errors and on platforms that do not use
the DPU controller, the deinitilisation code can be called with the kms
pointer set to NULL.

Fixes: 98659487b845 ("drm/msm: add support to take dpu snapshot")
Cc: sta...@vger.kernel.org  # 5.14
Cc: Abhinav Kumar 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/msm_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index 9ded384acba4..17a59d73fe01 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -242,7 +242,8 @@ static int msm_drm_uninit(struct device *dev)
msm_fbdev_free(ddev);
 #endif
 
-   msm_disp_snapshot_destroy(ddev);
+   if (kms)
+   msm_disp_snapshot_destroy(ddev);
 
drm_mode_config_cleanup(ddev);
 
-- 
2.39.2



[PATCH 02/10] Revert "drm/msm: Fix failure paths in msm_drm_init()"

2023-03-06 Thread Johan Hovold
This reverts commit 8636500300a01740d92b345c680b036b94555b1b.

A recent commit tried to address a drm device leak in the early
msm_drm_uninit() error paths but ended up making things worse.

Specifically, it moved the drm device reference put in msm_drm_uninit()
to msm_drm_init() which means that the drm would now be leaked on normal
unbind.

For reasons that were never spelled out, it also added kms NULL pointer
checks to a couple of helper functions that had nothing to do with the
paths modified by the patch.

Instead of trying to salvage this incrementally, let's revert the bad
commit so that clean and backportable fixes can be added in its place.

Fixes: 8636500300a0 ("drm/msm: Fix failure paths in msm_drm_init()")
Cc: Akhil P Oommen 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/disp/msm_disp_snapshot.c |  3 ---
 drivers/gpu/drm/msm/msm_drv.c| 11 ---
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c 
b/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c
index b73031cd48e4..e75b97127c0d 100644
--- a/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c
+++ b/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c
@@ -129,9 +129,6 @@ void msm_disp_snapshot_destroy(struct drm_device *drm_dev)
}
 
priv = drm_dev->dev_private;
-   if (!priv->kms)
-   return;
-
kms = priv->kms;
 
if (kms->dump_worker)
diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index b7f5a78eadd4..9ded384acba4 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -150,9 +150,6 @@ static void msm_irq_uninstall(struct drm_device *dev)
struct msm_drm_private *priv = dev->dev_private;
struct msm_kms *kms = priv->kms;
 
-   if (!priv->kms)
-   return;
-
kms->funcs->irq_uninstall(kms);
if (kms->irq_requested)
free_irq(kms->irq, dev);
@@ -270,6 +267,8 @@ static int msm_drm_uninit(struct device *dev)
component_unbind_all(dev, ddev);
 
ddev->dev_private = NULL;
+   drm_dev_put(ddev);
+
destroy_workqueue(priv->wq);
 
return 0;
@@ -442,12 +441,12 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
 
ret = msm_init_vram(ddev);
if (ret)
-   goto err_drm_dev_put;
+   return ret;
 
/* Bind all our sub-components: */
ret = component_bind_all(dev, ddev);
if (ret)
-   goto err_drm_dev_put;
+   return ret;
 
dma_set_max_seg_size(dev, UINT_MAX);
 
@@ -542,8 +541,6 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
 
 err_msm_uninit:
msm_drm_uninit(dev);
-err_drm_dev_put:
-   drm_dev_put(ddev);
return ret;
 }
 
-- 
2.39.2



[PATCH 08/10] drm/msm: fix workqueue leak on bind errors

2023-03-06 Thread Johan Hovold
Make sure to destroy the workqueue also in case of early errors during
bind (e.g. a subcomponent failing to bind).

Since commit c3b790ea07a1 ("drm: Manage drm_mode_config_init with
drmm_") the mode config will be freed when the drm device is released
also when using the legacy interface, but add an explicit cleanup for
consistency and to facilitate backporting.

Fixes: 060530f1ea67 ("drm/msm: use componentised device support")
Cc: sta...@vger.kernel.org  # 3.15
Cc: Rob Clark 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/msm_drv.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index ac3b77dbfacc..73c597565f99 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -458,7 +458,7 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
 
ret = msm_init_vram(ddev);
if (ret)
-   goto err_put_dev;
+   goto err_cleanup_mode_config;
 
/* Bind all our sub-components: */
ret = component_bind_all(dev, ddev);
@@ -563,6 +563,9 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
 
 err_deinit_vram:
msm_deinit_vram(ddev);
+err_cleanup_mode_config:
+   drm_mode_config_cleanup(ddev);
+   destroy_workqueue(priv->wq);
 err_put_dev:
drm_dev_put(ddev);
 
-- 
2.39.2



[PATCH 06/10] drm/msm: fix vram leak on bind errors

2023-03-06 Thread Johan Hovold
Make sure to release the VRAM buffer also in a case a subcomponent fails
to bind.

Fixes: d863f0c7b536 ("drm/msm: Call msm_init_vram before binding the gpu")
Cc: sta...@vger.kernel.org  # 5.11
Cc: Craig Tatlor 
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/msm_drv.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index 89634159ad75..41cc6cd690cd 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -51,6 +51,8 @@
 #define MSM_VERSION_MINOR  10
 #define MSM_VERSION_PATCHLEVEL 0
 
+static void msm_deinit_vram(struct drm_device *ddev);
+
 static const struct drm_mode_config_funcs mode_config_funcs = {
.fb_create = msm_framebuffer_create,
.output_poll_changed = drm_fb_helper_output_poll_changed,
@@ -260,12 +262,7 @@ static int msm_drm_uninit(struct device *dev)
if (kms && kms->funcs)
kms->funcs->destroy(kms);
 
-   if (priv->vram.paddr) {
-   unsigned long attrs = DMA_ATTR_NO_KERNEL_MAPPING;
-   drm_mm_takedown(>vram.mm);
-   dma_free_attrs(dev, priv->vram.size, NULL,
-  priv->vram.paddr, attrs);
-   }
+   msm_deinit_vram(ddev);
 
component_unbind_all(dev, ddev);
 
@@ -403,6 +400,19 @@ static int msm_init_vram(struct drm_device *dev)
return ret;
 }
 
+static void msm_deinit_vram(struct drm_device *ddev)
+{
+   struct msm_drm_private *priv = ddev->dev_private;
+   unsigned long attrs = DMA_ATTR_NO_KERNEL_MAPPING;
+
+   if (!priv->vram.paddr)
+   return;
+
+   drm_mm_takedown(>vram.mm);
+   dma_free_attrs(ddev->dev, priv->vram.size, NULL, priv->vram.paddr,
+   attrs);
+}
+
 static int msm_drm_init(struct device *dev, const struct drm_driver *drv)
 {
struct msm_drm_private *priv = dev_get_drvdata(dev);
@@ -449,7 +459,7 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
/* Bind all our sub-components: */
ret = component_bind_all(dev, ddev);
if (ret)
-   goto err_put_dev;
+   goto err_deinit_vram;
 
dma_set_max_seg_size(dev, UINT_MAX);
 
@@ -547,6 +557,8 @@ static int msm_drm_init(struct device *dev, const struct 
drm_driver *drv)
 
return ret;
 
+err_deinit_vram:
+   msm_deinit_vram(ddev);
 err_put_dev:
drm_dev_put(ddev);
 
-- 
2.39.2



[PATCH v2 3/4] drm/msm/adreno: drop bogus pm_runtime_set_active()

2023-03-03 Thread Johan Hovold
The runtime PM status can only be updated while runtime PM is disabled.

Drop the bogus pm_runtime_set_active() call that was made after enabling
runtime PM and which (incidentally but correctly) left the runtime PM
status set to 'suspended'.

Fixes: 2c087a336676 ("drm/msm/adreno: Load the firmware before bringing up the 
hardware")
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c 
b/drivers/gpu/drm/msm/adreno/adreno_device.c
index f9a0b11c2e43..d9100e3870bc 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -438,9 +438,6 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev)
 */
pm_runtime_enable(>dev);
 
-   /* Make sure pm runtime is active and reset any previous errors */
-   pm_runtime_set_active(>dev);
-
ret = pm_runtime_get_sync(>dev);
if (ret < 0) {
pm_runtime_put_noidle(>dev);
-- 
2.39.2



[PATCH v2 4/4] drm/msm/adreno: clean up component ops indentation

2023-03-03 Thread Johan Hovold
Clean up the component ops initialisers which were indented one level
too far.

Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c 
b/drivers/gpu/drm/msm/adreno/adreno_device.c
index d9100e3870bc..f2cdc5ad7ce7 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -571,8 +571,8 @@ static void adreno_unbind(struct device *dev, struct device 
*master,
 }
 
 static const struct component_ops a3xx_ops = {
-   .bind   = adreno_bind,
-   .unbind = adreno_unbind,
+   .bind   = adreno_bind,
+   .unbind = adreno_unbind,
 };
 
 static void adreno_device_register_headless(void)
-- 
2.39.2



[PATCH v2 0/4] drm/msm/adreno: fix runtime PM imbalance at unbind

2023-03-03 Thread Johan Hovold
As reported by Bjorn, we can end up with an unbalanced runtime PM
disable count if unbind() is called before the DRM device is opened
(e.g. if component bind fails due to the panel driver not having been
loaded yet).

As runtime PM must currently stay disabled until the firmware has been
loaded, fix this by making the runtime PM disable call at unbind()
conditional.

The rest of the series fixes further imbalances in the load_gpu() error
paths and removes a bogus pm_runtime_set_active() call. Included is also
a related indentation cleanup.

Johan


Changes in v2
 - fix the runtime PM imbalance in the gpu load error paths (new)

 - drop the patch removing the pm_runtime_disable() from
   adreno_gpu_cleanup() as this function can currently still be called
   with runtime PM enabled if suspending the scheduler in
   adreno_system_suspend() at unbind fails


Johan Hovold (4):
  drm/msm/adreno: fix runtime PM imbalance at unbind
  drm/msm/adreno: fix runtime PM imbalance at gpu load
  drm/msm/adreno: drop bogus pm_runtime_set_active()
  drm/msm/adreno: clean up component ops indentation

 drivers/gpu/drm/msm/adreno/adreno_device.c | 26 +-
 1 file changed, 16 insertions(+), 10 deletions(-)

-- 
2.39.2



[PATCH v2 1/4] drm/msm/adreno: fix runtime PM imbalance at unbind

2023-03-03 Thread Johan Hovold
A recent commit moved enabling of runtime PM from adreno_gpu_init() to
adreno_load_gpu() (called on first open()), which means that unbind()
may now be called with runtime PM disabled in case the device was never
opened in between.

Make sure to only forcibly suspend and disable runtime PM at unbind() in
case runtime PM has been enabled to prevent a disable count imbalance.

This specifically avoids leaving runtime PM disabled when the device
is later opened after a successful bind:

msm_dpu ae01000.display-controller: [drm:adreno_load_gpu [msm]] *ERROR* 
Couldn't power up the GPU: -13

Fixes: 4b18299b3365 ("drm/msm/adreno: Defer enabling runpm until hw_init()")
Reported-by: Bjorn Andersson 
Link: 
https://lore.kernel.org/lkml/20230203181245.3523937-1-quic_bjora...@quicinc.com
Cc: sta...@vger.kernel.org  # 6.0
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c 
b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 36f062c7582f..c5c4c93b3689 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -558,7 +558,8 @@ static void adreno_unbind(struct device *dev, struct device 
*master,
struct msm_drm_private *priv = dev_get_drvdata(master);
struct msm_gpu *gpu = dev_to_gpu(dev);
 
-   WARN_ON_ONCE(adreno_system_suspend(dev));
+   if (pm_runtime_enabled(dev))
+   WARN_ON_ONCE(adreno_system_suspend(dev));
gpu->funcs->destroy(gpu);
 
priv->gpu_pdev = NULL;
-- 
2.39.2



[PATCH v2 2/4] drm/msm/adreno: fix runtime PM imbalance at gpu load

2023-03-03 Thread Johan Hovold
A recent commit moved enabling of runtime PM to GPU load time (first
open()) but failed to update the error paths so that runtime PM is
disabled if initialisation of the GPU fails. This would trigger a
warning about the unbalanced disable count on the next open() attempt.

Note that pm_runtime_put_noidle() is sufficient to balance the usage
count when pm_runtime_put_sync() fails (and is chosen over
pm_runtime_resume_and_get() for consistency reasons).

Fixes: 4b18299b3365 ("drm/msm/adreno: Defer enabling runpm until hw_init()")
Cc: sta...@vger.kernel.org  # 6.0
Signed-off-by: Johan Hovold 
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c 
b/drivers/gpu/drm/msm/adreno/adreno_device.c
index c5c4c93b3689..f9a0b11c2e43 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -443,20 +443,21 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev)
 
ret = pm_runtime_get_sync(>dev);
if (ret < 0) {
-   pm_runtime_put_sync(>dev);
+   pm_runtime_put_noidle(>dev);
DRM_DEV_ERROR(dev->dev, "Couldn't power up the GPU: %d\n", ret);
-   return NULL;
+   goto err_disable_rpm;
}
 
mutex_lock(>lock);
ret = msm_gpu_hw_init(gpu);
mutex_unlock(>lock);
-   pm_runtime_put_autosuspend(>dev);
if (ret) {
DRM_DEV_ERROR(dev->dev, "gpu hw init failed: %d\n", ret);
-   return NULL;
+   goto err_put_rpm;
}
 
+   pm_runtime_put_autosuspend(>dev);
+
 #ifdef CONFIG_DEBUG_FS
if (gpu->funcs->debugfs_init) {
gpu->funcs->debugfs_init(gpu, dev->primary);
@@ -465,6 +466,13 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev)
 #endif
 
return gpu;
+
+err_put_rpm:
+   pm_runtime_put_sync(>dev);
+err_disable_rpm:
+   pm_runtime_disable(>dev);
+
+   return NULL;
 }
 
 static int find_chipid(struct device *dev, struct adreno_rev *rev)
-- 
2.39.2



  1   2   3   >