Re: [BUG] drm: zynqmp_dp: Lockup in zynqmp_dp_bridge_detect when device is unbound

2024-05-07 Thread Maxime Ripard
On Mon, May 06, 2024 at 07:50:57PM GMT, Laurent Pinchart wrote:
> On Mon, May 06, 2024 at 10:57:17AM -0400, Sean Anderson wrote:
> > On 5/6/24 03:35, Laurent Pinchart wrote:
> > > On Mon, May 06, 2024 at 09:29:36AM +0200, Maxime Ripard wrote:
> > >> Hi Laurent, Sean,
> > >> 
> > >> On Sat, May 04, 2024 at 03:21:18PM GMT, Laurent Pinchart wrote:
> > >> > On Fri, May 03, 2024 at 05:54:32PM -0400, Sean Anderson wrote:
> > >> > > I have discovered a bug in the displayport driver on drm-misc-next. 
> > >> > > To
> > >> > > trigger it, run
> > >> > > 
> > >> > > echo fd4a.display > /sys/bus/platform/drivers/zynqmp-dpsub/unbind
> > >> > > 
> > >> > > The system will become unresponsive and (after a bit) splat with a 
> > >> > > hard
> > >> > > LOCKUP. One core will be unresponsive at the first zynqmp_dp_read in
> > >> > > zynqmp_dp_bridge_detect.
> > >> > > 
> > >> > > I believe the issue is due the registers being unmapped and the block
> > >> > > put into reset in zynqmp_dp_remove instead of zynqmp_dpsub_release.
> > >> > 
> > >> > That is on purpose. Drivers are not allowed to access the device at all
> > >> > after .remove() returns.
> > >> 
> > >> It's not "on purpose" no. Drivers indeed are not allowed to access the
> > >> device after remove, but the kernel shouldn't crash. This is exactly
> > >> why we have drm_dev_enter / drm_dev_exit.
> > > 
> > > I didn't mean the crash was on purpose :-) It's the registers being
> > > unmapped that is, as nothing should touch those registers after
> > > .remove() returns.
> > 
> > OK, so then we need to have some kind of flag in the driver or in the drm
> > subsystem so we know not to access those registers.
> 
> To avoid race conditions, the .remove() function should mark the device
> as removed, wait for all ongoing access from userspace to be complete,
> and then proceed to unmapping registers and doing other cleanups.
> Userspace may still have open file descriptors to the device at that
> point. Any new userspace access should be disallowed (by checking the
> removed flag), with the only userspace-initiated operations that still
> need to run being the release-related operations (unmapping memory,
> closing file descriptors, ...).

And for the record, this is exactly what drm_dev_unplug and
drm_dev_enter/drm_dev_exit does.

Maxime


signature.asc
Description: PGP signature


Re: [BUG] drm: zynqmp_dp: Lockup in zynqmp_dp_bridge_detect when device is unbound

2024-05-06 Thread Laurent Pinchart
On Mon, May 06, 2024 at 10:57:17AM -0400, Sean Anderson wrote:
> On 5/6/24 03:35, Laurent Pinchart wrote:
> > On Mon, May 06, 2024 at 09:29:36AM +0200, Maxime Ripard wrote:
> >> Hi Laurent, Sean,
> >> 
> >> On Sat, May 04, 2024 at 03:21:18PM GMT, Laurent Pinchart wrote:
> >> > On Fri, May 03, 2024 at 05:54:32PM -0400, Sean Anderson wrote:
> >> > > I have discovered a bug in the displayport driver on drm-misc-next. To
> >> > > trigger it, run
> >> > > 
> >> > > echo fd4a.display > /sys/bus/platform/drivers/zynqmp-dpsub/unbind
> >> > > 
> >> > > The system will become unresponsive and (after a bit) splat with a hard
> >> > > LOCKUP. One core will be unresponsive at the first zynqmp_dp_read in
> >> > > zynqmp_dp_bridge_detect.
> >> > > 
> >> > > I believe the issue is due the registers being unmapped and the block
> >> > > put into reset in zynqmp_dp_remove instead of zynqmp_dpsub_release.
> >> > 
> >> > That is on purpose. Drivers are not allowed to access the device at all
> >> > after .remove() returns.
> >> 
> >> It's not "on purpose" no. Drivers indeed are not allowed to access the
> >> device after remove, but the kernel shouldn't crash. This is exactly
> >> why we have drm_dev_enter / drm_dev_exit.
> > 
> > I didn't mean the crash was on purpose :-) It's the registers being
> > unmapped that is, as nothing should touch those registers after
> > .remove() returns.
> 
> OK, so then we need to have some kind of flag in the driver or in the drm
> subsystem so we know not to access those registers.

To avoid race conditions, the .remove() function should mark the device
as removed, wait for all ongoing access from userspace to be complete,
and then proceed to unmapping registers and doing other cleanups.
Userspace may still have open file descriptors to the device at that
point. Any new userspace access should be disallowed (by checking the
removed flag), with the only userspace-initiated operations that still
need to run being the release-related operations (unmapping memory,
closing file descriptors, ...).

-- 
Regards,

Laurent Pinchart


Re: [BUG] drm: zynqmp_dp: Lockup in zynqmp_dp_bridge_detect when device is unbound

2024-05-06 Thread Sean Anderson
On 5/6/24 03:35, Laurent Pinchart wrote:
> On Mon, May 06, 2024 at 09:29:36AM +0200, Maxime Ripard wrote:
>> Hi Laurent, Sean,
>> 
>> On Sat, May 04, 2024 at 03:21:18PM GMT, Laurent Pinchart wrote:
>> > On Fri, May 03, 2024 at 05:54:32PM -0400, Sean Anderson wrote:
>> > > I have discovered a bug in the displayport driver on drm-misc-next. To
>> > > trigger it, run
>> > > 
>> > > echo fd4a.display > /sys/bus/platform/drivers/zynqmp-dpsub/unbind
>> > > 
>> > > The system will become unresponsive and (after a bit) splat with a hard
>> > > LOCKUP. One core will be unresponsive at the first zynqmp_dp_read in
>> > > zynqmp_dp_bridge_detect.
>> > > 
>> > > I believe the issue is due the registers being unmapped and the block
>> > > put into reset in zynqmp_dp_remove instead of zynqmp_dpsub_release.
>> > 
>> > That is on purpose. Drivers are not allowed to access the device at all
>> > after .remove() returns.
>> 
>> It's not "on purpose" no. Drivers indeed are not allowed to access the
>> device after remove, but the kernel shouldn't crash. This is exactly
>> why we have drm_dev_enter / drm_dev_exit.
> 
> I didn't mean the crash was on purpose :-) It's the registers being
> unmapped that is, as nothing should touch those registers after
> .remove() returns.

OK, so then we need to have some kind of flag in the driver or in the drm
subsystem so we know not to access those registers.

--Sean



Re: [BUG] drm: zynqmp_dp: Lockup in zynqmp_dp_bridge_detect when device is unbound

2024-05-06 Thread Sean Anderson
On 5/6/24 07:16, Tomi Valkeinen wrote:
> Hi,
> 
> On 04/05/2024 00:54, Sean Anderson wrote:
>> Hi,
>>
>> I have discovered a bug in the displayport driver on drm-misc-next. To
>> trigger it, run
>>
>> echo fd4a.display > /sys/bus/platform/drivers/zynqmp-dpsub/unbind
>>
>> The system will become unresponsive and (after a bit) splat with a hard
>> LOCKUP. One core will be unresponsive at the first zynqmp_dp_read in
>> zynqmp_dp_bridge_detect.
>>
>> I believe the issue is due the registers being unmapped and the block
>> put into reset in zynqmp_dp_remove instead of zynqmp_dpsub_release. This
>> could be resolved by deferring things until zynqmp_dpsub_release
>> (requiring us to skip devm_*), or by adding a flag to struct zynqmp_dp
>> and checking it before each callback. A subsystem-level implementation
>> might be better for the latter.
>>
>> For a better traceback, try applying the below patch and running the
>> following commands before triggering the lockup:
>>
>> echo 4 > /sys/module/drm/parameters/debug
>> echo 8 > /proc/sys/kernel/printk
> 
> I wasn't able to reproduce. Where does the detect call come in your case? 
> Looking at the code, afaics, it shouldn't happen.

# echo fd4a.display > /sys/bus/platform/drivers/zynqmp-dpsub/unbind
[234105.917005] Console: switching to colour dummy device 80x25
[234105.962474] zynqmp-dpsub fd4a.display: [drm:drm_client_release] fbdev
[234105.970397] zynqmp-dpsub fd4a.display: [drm:drm_sysfs_connector_remove] 
[CONNECTOR:41:DP-1] removing connector from sysfs
[234105.991669] zynqmp-dpsub fd4a.display: 
[drm:drm_helper_probe_single_connector_modes] [CONNECTOR:41:DP-1]
[234106.001833] [ cut here ]
[234106.006570] WARNING: CPU: 2 PID: 527 at 
drivers/gpu/drm/xlnx/zynqmp_dp.c:1537 zynqmp_dp_bridge_detect 
(drivers/gpu/drm/xlnx/zynqmp_dp.c:1537 (discriminator 1)) 
[234106.016960] Modules linked in:
[234106.021306] CPU: 2 PID: 527 Comm: Xorg Not tainted 6.9.0-rc6+ #34
[234106.027858] Hardware name: xlnx,zynqmp (DT)
[234106.032146] pstate: a005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[234106.039219] pc : zynqmp_dp_bridge_detect 
(drivers/gpu/drm/xlnx/zynqmp_dp.c:1537 (discriminator 1)) 
[234106.044297] lr : drm_bridge_connector_detect 
(drivers/gpu/drm/drm_bridge_connector.c:176) 
[234106.049548] sp : ffc0895bb980
[234106.052968] x29: ffc0895bb980 x28: ff8805fe1000 x27: 
ffc0818ff000
[234106.060251] x26: 0001 x25: 0050 x24: 
1000
[234106.067534] x23: 1000 x22: ff8805fe0030 x21: 
ff8805fe1000
[234106.074816] x20: ffc0895bbae8 x19: ff8805fe1000 x18: 
5550
[234106.082099] x17: 6e6e6f635f656c67 x16: 6e69735f65626f72 x15: 
705f7265706c6568
[234106.089382] x14: 5f6d72643a6d7264 x13:  x12: 

[234106.096664] x11: 02c5 x10: 02c5 x9 : 
000402c5
[234106.103947] x8 : 95d72cee x7 : f1c4be6a x6 : 
ffc082707b78
[234106.111229] x5 :  x4 : ff8801f2 x3 : 

[234106.118512] x2 : ffc080a5a2bc x1 : 0123456789abcdef x0 : 
deadbeefdeadbeef
[234106.125795] Call trace:
[234106.128347] zynqmp_dp_bridge_detect (drivers/gpu/drm/xlnx/zynqmp_dp.c:1537 
(discriminator 1)) 
[234106.133077] drm_bridge_connector_detect 
(drivers/gpu/drm/drm_bridge_connector.c:176) 
[234106.137981] drm_helper_probe_detect 
(drivers/gpu/drm/drm_probe_helper.c:407) 
[234106.142538] drm_helper_probe_single_connector_modes 
(drivers/gpu/drm/drm_probe_helper.c:596) 
[234106.148658] drm_mode_getconnector (drivers/gpu/drm/drm_connector.c:2947) 
[234106.153215] drm_ioctl_kernel (drivers/gpu/drm/drm_ioctl.c:744 
(discriminator 1)) 
[234106.157251] drm_ioctl (drivers/gpu/drm/drm_ioctl.c:841) 
[234106.160767] __arm64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890 
fs/ioctl.c:890) 
[234106.164803] invoke_syscall (arch/arm64/include/asm/current.h:19 
arch/arm64/kernel/syscall.c:53) 
[234106.168665] el0_svc_common.constprop.0 (arch/arm64/kernel/syscall.c:140) 
[234106.173483] do_el0_svc (arch/arm64/kernel/syscall.c:153) 
[234106.176911] el0_svc (arch/arm64/include/asm/irqflags.h:83 
arch/arm64/include/asm/irqflags.h:124 arch/arm64/include/asm/irqflags.h:137 
arch/arm64/kernel/entry-common.c:165 arch/arm64/kernel/entry-common.c:178 
arch/arm64/kernel/entry-common.c:713) 
[234106.180166] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:731) 
[234106.184637] el0t_64_sync (arch/arm64/kernel/entry.S:598) 
[234106.188413] irq event stamp: 348036
[234106.192006] hardirqs last enabled at (348035): console_unlock 
(arch/arm64/include/asm/irqflags.h:83 arch/arm64/include/asm/irqflags.h:124 
arch/arm64/include/asm/irqflags.h:137 kernel/printk/printk.c:341 
kernel/printk/printk.c:2731 kernel/printk/printk.c:3050) 
[234106.200816] hardirqs last disabled at (348036): el1_dbg 
(arch/arm64/kernel/entry-common.c:371 (discriminator 4) 
arch/arm64/kernel/entry-common.c:471 

Re: [BUG] drm: zynqmp_dp: Lockup in zynqmp_dp_bridge_detect when device is unbound

2024-05-06 Thread Tomi Valkeinen

Hi,

On 04/05/2024 00:54, Sean Anderson wrote:

Hi,

I have discovered a bug in the displayport driver on drm-misc-next. To
trigger it, run

echo fd4a.display > /sys/bus/platform/drivers/zynqmp-dpsub/unbind

The system will become unresponsive and (after a bit) splat with a hard
LOCKUP. One core will be unresponsive at the first zynqmp_dp_read in
zynqmp_dp_bridge_detect.

I believe the issue is due the registers being unmapped and the block
put into reset in zynqmp_dp_remove instead of zynqmp_dpsub_release. This
could be resolved by deferring things until zynqmp_dpsub_release
(requiring us to skip devm_*), or by adding a flag to struct zynqmp_dp
and checking it before each callback. A subsystem-level implementation
might be better for the latter.

For a better traceback, try applying the below patch and running the
following commands before triggering the lockup:

echo 4 > /sys/module/drm/parameters/debug
echo 8 > /proc/sys/kernel/printk


I wasn't able to reproduce. Where does the detect call come in your 
case? Looking at the code, afaics, it shouldn't happen.


 Tomi


diff --git a/drivers/gpu/drm/xlnx/zynqmp_dp.c b/drivers/gpu/drm/xlnx/zynqmp_dp.c
index 9df068a413f3..17b477b14ab5 100644
--- a/drivers/gpu/drm/xlnx/zynqmp_dp.c
+++ b/drivers/gpu/drm/xlnx/zynqmp_dp.c
@@ -296,6 +296,7 @@ struct zynqmp_dp_config {
   * @train_set: set of training data
   */
  struct zynqmp_dp {
+   unsigned long long magic;
 struct device *dev;
 struct zynqmp_dpsub *dpsub;
 void __iomem *iomem;
@@ -1533,6 +1534,8 @@ static enum drm_connector_status 
zynqmp_dp_bridge_detect(struct drm_bridge *brid
 u32 state, i;
 int ret;
  
+   WARN_ON(dp->magic != 0x0123456789abcdefULL);

+
 /*
  * This is from heuristic. It takes some delay (ex, 100 ~ 500 msec) to
  * get the HPD signal with some monitors.
@@ -1723,6 +1726,7 @@ int zynqmp_dp_probe(struct zynqmp_dpsub *dpsub)
 if (!dp)
 return -ENOMEM;
  
+   dp->magic = 0x0123456789abcdefULL;

 dp->dev = >dev;
 dp->dpsub = dpsub;
 dp->status = connector_status_disconnected;
@@ -1839,4 +1843,5 @@ void zynqmp_dp_remove(struct zynqmp_dpsub *dpsub)
  
 zynqmp_dp_phy_exit(dp);

 zynqmp_dp_reset(dp, true);
+   dp->magic = 0xdeadbeefdeadbeefULL;
  }




Re: [BUG] drm: zynqmp_dp: Lockup in zynqmp_dp_bridge_detect when device is unbound

2024-05-06 Thread Laurent Pinchart
On Mon, May 06, 2024 at 09:29:36AM +0200, Maxime Ripard wrote:
> Hi Laurent, Sean,
> 
> On Sat, May 04, 2024 at 03:21:18PM GMT, Laurent Pinchart wrote:
> > On Fri, May 03, 2024 at 05:54:32PM -0400, Sean Anderson wrote:
> > > I have discovered a bug in the displayport driver on drm-misc-next. To
> > > trigger it, run
> > > 
> > > echo fd4a.display > /sys/bus/platform/drivers/zynqmp-dpsub/unbind
> > > 
> > > The system will become unresponsive and (after a bit) splat with a hard
> > > LOCKUP. One core will be unresponsive at the first zynqmp_dp_read in
> > > zynqmp_dp_bridge_detect.
> > > 
> > > I believe the issue is due the registers being unmapped and the block
> > > put into reset in zynqmp_dp_remove instead of zynqmp_dpsub_release.
> > 
> > That is on purpose. Drivers are not allowed to access the device at all
> > after .remove() returns.
> 
> It's not "on purpose" no. Drivers indeed are not allowed to access the
> device after remove, but the kernel shouldn't crash. This is exactly
> why we have drm_dev_enter / drm_dev_exit.

I didn't mean the crash was on purpose :-) It's the registers being
unmapped that is, as nothing should touch those registers after
.remove() returns.

-- 
Regards,

Laurent Pinchart


Re: [BUG] drm: zynqmp_dp: Lockup in zynqmp_dp_bridge_detect when device is unbound

2024-05-06 Thread Maxime Ripard
Hi Laurent, Sean,

On Sat, May 04, 2024 at 03:21:18PM GMT, Laurent Pinchart wrote:
> On Fri, May 03, 2024 at 05:54:32PM -0400, Sean Anderson wrote:
> > I have discovered a bug in the displayport driver on drm-misc-next. To
> > trigger it, run
> > 
> > echo fd4a.display > /sys/bus/platform/drivers/zynqmp-dpsub/unbind
> > 
> > The system will become unresponsive and (after a bit) splat with a hard
> > LOCKUP. One core will be unresponsive at the first zynqmp_dp_read in
> > zynqmp_dp_bridge_detect.
> > 
> > I believe the issue is due the registers being unmapped and the block
> > put into reset in zynqmp_dp_remove instead of zynqmp_dpsub_release.
> 
> That is on purpose. Drivers are not allowed to access the device at all
> after .remove() returns.

It's not "on purpose" no. Drivers indeed are not allowed to access the
device after remove, but the kernel shouldn't crash. This is exactly
why we have drm_dev_enter / drm_dev_exit.

Maxime


signature.asc
Description: PGP signature


Re: [BUG] drm: zynqmp_dp: Lockup in zynqmp_dp_bridge_detect when device is unbound

2024-05-04 Thread Laurent Pinchart
Hi Sean,

On Fri, May 03, 2024 at 05:54:32PM -0400, Sean Anderson wrote:
> Hi,
> 
> I have discovered a bug in the displayport driver on drm-misc-next. To
> trigger it, run
> 
> echo fd4a.display > /sys/bus/platform/drivers/zynqmp-dpsub/unbind
> 
> The system will become unresponsive and (after a bit) splat with a hard
> LOCKUP. One core will be unresponsive at the first zynqmp_dp_read in
> zynqmp_dp_bridge_detect.
> 
> I believe the issue is due the registers being unmapped and the block
> put into reset in zynqmp_dp_remove instead of zynqmp_dpsub_release.

That is on purpose. Drivers are not allowed to access the device at all
after .remove() returns.

> This
> could be resolved by deferring things until zynqmp_dpsub_release
> (requiring us to skip devm_*), or by adding a flag to struct zynqmp_dp
> and checking it before each callback. A subsystem-level implementation
> might be better for the latter.
> 
> For a better traceback, try applying the below patch and running the
> following commands before triggering the lockup:
> 
> echo 4 > /sys/module/drm/parameters/debug
> echo 8 > /proc/sys/kernel/printk
> 
> diff --git a/drivers/gpu/drm/xlnx/zynqmp_dp.c 
> b/drivers/gpu/drm/xlnx/zynqmp_dp.c
> index 9df068a413f3..17b477b14ab5 100644
> --- a/drivers/gpu/drm/xlnx/zynqmp_dp.c
> +++ b/drivers/gpu/drm/xlnx/zynqmp_dp.c
> @@ -296,6 +296,7 @@ struct zynqmp_dp_config {
>   * @train_set: set of training data
>   */
>  struct zynqmp_dp {
> +   unsigned long long magic;
> struct device *dev;
> struct zynqmp_dpsub *dpsub;
> void __iomem *iomem;
> @@ -1533,6 +1534,8 @@ static enum drm_connector_status 
> zynqmp_dp_bridge_detect(struct drm_bridge *brid
> u32 state, i;
> int ret;
>  
> +   WARN_ON(dp->magic != 0x0123456789abcdefULL);
> +
> /*
>  * This is from heuristic. It takes some delay (ex, 100 ~ 500 msec) to
>  * get the HPD signal with some monitors.
> @@ -1723,6 +1726,7 @@ int zynqmp_dp_probe(struct zynqmp_dpsub *dpsub)
> if (!dp)
> return -ENOMEM;
>  
> +   dp->magic = 0x0123456789abcdefULL;
> dp->dev = >dev;
> dp->dpsub = dpsub;
> dp->status = connector_status_disconnected;
> @@ -1839,4 +1843,5 @@ void zynqmp_dp_remove(struct zynqmp_dpsub *dpsub)
>  
> zynqmp_dp_phy_exit(dp);
> zynqmp_dp_reset(dp, true);
> +   dp->magic = 0xdeadbeefdeadbeefULL;
>  }

-- 
Regards,

Laurent Pinchart


[BUG] drm: zynqmp_dp: Lockup in zynqmp_dp_bridge_detect when device is unbound

2024-05-03 Thread Sean Anderson
Hi,

I have discovered a bug in the displayport driver on drm-misc-next. To
trigger it, run

echo fd4a.display > /sys/bus/platform/drivers/zynqmp-dpsub/unbind

The system will become unresponsive and (after a bit) splat with a hard
LOCKUP. One core will be unresponsive at the first zynqmp_dp_read in
zynqmp_dp_bridge_detect.

I believe the issue is due the registers being unmapped and the block
put into reset in zynqmp_dp_remove instead of zynqmp_dpsub_release. This
could be resolved by deferring things until zynqmp_dpsub_release
(requiring us to skip devm_*), or by adding a flag to struct zynqmp_dp
and checking it before each callback. A subsystem-level implementation
might be better for the latter.

For a better traceback, try applying the below patch and running the
following commands before triggering the lockup:

echo 4 > /sys/module/drm/parameters/debug
echo 8 > /proc/sys/kernel/printk

diff --git a/drivers/gpu/drm/xlnx/zynqmp_dp.c b/drivers/gpu/drm/xlnx/zynqmp_dp.c
index 9df068a413f3..17b477b14ab5 100644
--- a/drivers/gpu/drm/xlnx/zynqmp_dp.c
+++ b/drivers/gpu/drm/xlnx/zynqmp_dp.c
@@ -296,6 +296,7 @@ struct zynqmp_dp_config {
  * @train_set: set of training data
  */
 struct zynqmp_dp {
+   unsigned long long magic;
struct device *dev;
struct zynqmp_dpsub *dpsub;
void __iomem *iomem;
@@ -1533,6 +1534,8 @@ static enum drm_connector_status 
zynqmp_dp_bridge_detect(struct drm_bridge *brid
u32 state, i;
int ret;
 
+   WARN_ON(dp->magic != 0x0123456789abcdefULL);
+
/*
 * This is from heuristic. It takes some delay (ex, 100 ~ 500 msec) to
 * get the HPD signal with some monitors.
@@ -1723,6 +1726,7 @@ int zynqmp_dp_probe(struct zynqmp_dpsub *dpsub)
if (!dp)
return -ENOMEM;
 
+   dp->magic = 0x0123456789abcdefULL;
dp->dev = >dev;
dp->dpsub = dpsub;
dp->status = connector_status_disconnected;
@@ -1839,4 +1843,5 @@ void zynqmp_dp_remove(struct zynqmp_dpsub *dpsub)
 
zynqmp_dp_phy_exit(dp);
zynqmp_dp_reset(dp, true);
+   dp->magic = 0xdeadbeefdeadbeefULL;
 }