Hi Geert,

On 29/10/2025 14:08, Geert Uytterhoeven wrote:
> Hi all,
> 
> While playing with the PowerVR driver on various R-Car SoCs, I ran into
> a crash/race condition on Gray Hawk Single (R-Car V4M).  After adding
> the GPU device node to DTS, the driver fails to probe due to lack of
> suitable firmware, as expected:

Thanks for the detailed report! I'll make time to look into this. Do you
encounter a similar issue on other R-Car platforms, or is this exclusive
to the V4M?

> 
>     powervr fd000000.gpu: Direct firmware load for
> powervr/rogue_36.53.104.796_v1.fw failed with error -2
>     powervr fd000000.gpu: [drm] *ERROR* failed to load firmware
> powervr/rogue_36.53.104.796_v1.fw (err=-2)
> 
> However, after that it crashes:
> 
>     Unable to handle kernel NULL pointer dereference at virtual
> address 0000000000000040
>     Mem abort info:
>     PM: GENPD_STATE_OFF
>       ESR = 0x0000000096000004
>     PM: sd_count 19
>       EC = 0x25: DABT (current EL), IL = 32 bits
>       SET = 0, FnV = 0
>       EA = 0, S1PTW = 0
>       FSC = 0x04: level 0 translation fault
>     Data abort info:
>       ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>       CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>       GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>     [0000000000000040] user address but active_mm is swapper
>     Internal error: Oops: 0000000096000004 [#1]  SMP
>     CPU: 2 UID: 0 PID: 46 Comm: kworker/u16:2 Tainted: G        W
>      6.18.0-rc3-arm64-renesas-04934-g585255656363-dirty #3296 PREEMPT
>     Tainted: [W]=WARN
>     Hardware name: Renesas Gray Hawk Single board based on r8a779h0 (DT)
>     Workqueue: pm pm_runtime_work
>     pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>     pc : default_suspend_ok+0xb4/0x20c
>     lr : default_suspend_ok+0x9c/0x20c
>     sp : ffff800081e23bc0
>     x29: ffff800081e23bd0 x28: ffff800080e2d240 x27: ffff0004402d8bc0
>     x26: ffff0004402e2900 x25: ffff000440b554e4 x24: ffff80008108bb70
>     x23: 0000000000000001 x22: ffff80008108bb88 x21: ffff800080e2c568
>     x20: ffff800080e2d258 x19: ffff000440b55400 x18: 0000000000000006
>     x17: 2030683937376138 x16: 72206e6f20646573 x15: ffff800081e23510
>     x14: 0000000000000000 x13: 3035616234383138 x12: 3030303866666666
>     x11: 0000000000000533 x10: 000000000000005d x9 : 000000000001b64f
>     x8 : 7f7f7f7f7f7f7f7f x7 : 205d373032323131 x6 : 0000000000000000
>     x5 : 0000000000000030 x4 : 0000000000000000 x3 : 0000000000000043
>     x2 : ffff800080e2d258 x1 : ffff80008108bb70 x0 : ffff000440b55400
>     Call trace:
>      default_suspend_ok+0xb4/0x20c (P)
>      genpd_runtime_suspend+0x11c/0x4e0
>      __rpm_callback+0x44/0x1cc
>      rpm_callback+0x6c/0x78
>      rpm_suspend+0x108/0x564
>      pm_runtime_work+0xb8/0xbc
>      process_one_work+0x144/0x280
>      worker_thread+0x2c8/0x3d0
>      kthread+0x128/0x1e0
>      ret_from_fork+0x10/0x20
>     Code: aa1303e0 52800863 b0005661 912dc021 (f9402095)
>     ---[ end trace 0000000000000000 ]---
> 
> This driver uses manual PM Domain handling for multiple PM Domains.  In
> my case, pvr_power_domains_fini() calls dev_pm_domain_detach() (twice),
> which calls dev_pm_put_subsys_data(), and sets dev->power.subsys_data to
> NULL when psd->refcount reaches zero.
> 
> Later/in parallel, default_suspend_ok() calls dev_gpd_data():
> 
>     static inline struct generic_pm_domain_data *dev_gpd_data(struct
> device *dev)
>     {
>             return to_gpd_data(dev->power.subsys_data->domain_data);
>     }
> 
> triggering the NULL pointer dereference.  Depending on timing, it may
> crash earlier or later in genpd_runtime_suspend(), or not crash at all
> (initially, I saw it only with extra debug prints in the genpd subsystem).
> 
> As the driver mixes automatic (devm_*()) and manual cleanup, my first
> guess was that devm_pm_runtime_enable() would keep Runtime PM enabled
> too long after the manual call to pvr_power_domains_fini(), but
> replacing that by manual cleanup:

Regardless of whether it fixes this issue, it feels like we should clean
this up and stick to one cleanup method driver-wide.

> 
>     --- a/drivers/gpu/drm/imagination/pvr_drv.c
>     +++ b/drivers/gpu/drm/imagination/pvr_drv.c
>     @@ -1424,7 +1424,7 @@ pvr_probe(struct platform_device *plat_dev)
>             if (err)
>                     goto err_context_fini;
> 
>     -       devm_pm_runtime_enable(&plat_dev->dev);
>     +       pm_runtime_enable(&plat_dev->dev);
>             pm_runtime_mark_last_busy(&plat_dev->dev);
> 
>             pm_runtime_set_autosuspend_delay(&plat_dev->dev, 50);
>     @@ -1450,6 +1450,9 @@ pvr_probe(struct platform_device *plat_dev)
>      err_watchdog_fini:
>             pvr_watchdog_fini(pvr_dev);
> 
>     +       pm_runtime_dont_use_autosuspend(&plat_dev->dev);
>     +       pm_runtime_disable(&plat_dev->dev);
>     +
>             pvr_queue_device_fini(pvr_dev);
> 
>      err_context_fini:
>     @@ -1475,6 +1478,8 @@ static void pvr_remove(struct
> platform_device *plat_dev)
>             pvr_device_fini(pvr_dev);
>             drm_dev_unplug(drm_dev);
>             pvr_watchdog_fini(pvr_dev);
>     +       pm_runtime_dont_use_autosuspend(&plat_dev->dev);
>     +       pm_runtime_disable(&plat_dev->dev);
>             pvr_queue_device_fini(pvr_dev);
>             pvr_context_device_fini(pvr_dev);
>             pvr_power_domains_fini(pvr_dev);
> 
> did not fix the issue.  Calling pm_runtime_force_suspend() instead of
> pm_runtime_dont_use_autosuspend() also didn't help.
> 
> Do you have a clue?

Not yet, but we'll look into it :)

Cheers,
Matt

> Thanks!
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 

-- 
Matt Coster
E: [email protected]

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to