Hi all,

While playing with the PowerVR driver on various R-Car SoCs, I ran into
a crash/race condition on Gray Hawk Single (R-Car V4M).  After adding
the GPU device node to DTS, the driver fails to probe due to lack of
suitable firmware, as expected:

    powervr fd000000.gpu: Direct firmware load for
powervr/rogue_36.53.104.796_v1.fw failed with error -2
    powervr fd000000.gpu: [drm] *ERROR* failed to load firmware
powervr/rogue_36.53.104.796_v1.fw (err=-2)

However, after that it crashes:

    Unable to handle kernel NULL pointer dereference at virtual
address 0000000000000040
    Mem abort info:
    PM: GENPD_STATE_OFF
      ESR = 0x0000000096000004
    PM: sd_count 19
      EC = 0x25: DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
      FSC = 0x04: level 0 translation fault
    Data abort info:
      ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
      CM = 0, WnR = 0, TnD = 0, TagAccess = 0
      GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
    [0000000000000040] user address but active_mm is swapper
    Internal error: Oops: 0000000096000004 [#1]  SMP
    CPU: 2 UID: 0 PID: 46 Comm: kworker/u16:2 Tainted: G        W
     6.18.0-rc3-arm64-renesas-04934-g585255656363-dirty #3296 PREEMPT
    Tainted: [W]=WARN
    Hardware name: Renesas Gray Hawk Single board based on r8a779h0 (DT)
    Workqueue: pm pm_runtime_work
    pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    pc : default_suspend_ok+0xb4/0x20c
    lr : default_suspend_ok+0x9c/0x20c
    sp : ffff800081e23bc0
    x29: ffff800081e23bd0 x28: ffff800080e2d240 x27: ffff0004402d8bc0
    x26: ffff0004402e2900 x25: ffff000440b554e4 x24: ffff80008108bb70
    x23: 0000000000000001 x22: ffff80008108bb88 x21: ffff800080e2c568
    x20: ffff800080e2d258 x19: ffff000440b55400 x18: 0000000000000006
    x17: 2030683937376138 x16: 72206e6f20646573 x15: ffff800081e23510
    x14: 0000000000000000 x13: 3035616234383138 x12: 3030303866666666
    x11: 0000000000000533 x10: 000000000000005d x9 : 000000000001b64f
    x8 : 7f7f7f7f7f7f7f7f x7 : 205d373032323131 x6 : 0000000000000000
    x5 : 0000000000000030 x4 : 0000000000000000 x3 : 0000000000000043
    x2 : ffff800080e2d258 x1 : ffff80008108bb70 x0 : ffff000440b55400
    Call trace:
     default_suspend_ok+0xb4/0x20c (P)
     genpd_runtime_suspend+0x11c/0x4e0
     __rpm_callback+0x44/0x1cc
     rpm_callback+0x6c/0x78
     rpm_suspend+0x108/0x564
     pm_runtime_work+0xb8/0xbc
     process_one_work+0x144/0x280
     worker_thread+0x2c8/0x3d0
     kthread+0x128/0x1e0
     ret_from_fork+0x10/0x20
    Code: aa1303e0 52800863 b0005661 912dc021 (f9402095)
    ---[ end trace 0000000000000000 ]---

This driver uses manual PM Domain handling for multiple PM Domains.  In
my case, pvr_power_domains_fini() calls dev_pm_domain_detach() (twice),
which calls dev_pm_put_subsys_data(), and sets dev->power.subsys_data to
NULL when psd->refcount reaches zero.

Later/in parallel, default_suspend_ok() calls dev_gpd_data():

    static inline struct generic_pm_domain_data *dev_gpd_data(struct
device *dev)
    {
            return to_gpd_data(dev->power.subsys_data->domain_data);
    }

triggering the NULL pointer dereference.  Depending on timing, it may
crash earlier or later in genpd_runtime_suspend(), or not crash at all
(initially, I saw it only with extra debug prints in the genpd subsystem).

As the driver mixes automatic (devm_*()) and manual cleanup, my first
guess was that devm_pm_runtime_enable() would keep Runtime PM enabled
too long after the manual call to pvr_power_domains_fini(), but
replacing that by manual cleanup:

    --- a/drivers/gpu/drm/imagination/pvr_drv.c
    +++ b/drivers/gpu/drm/imagination/pvr_drv.c
    @@ -1424,7 +1424,7 @@ pvr_probe(struct platform_device *plat_dev)
            if (err)
                    goto err_context_fini;

    -       devm_pm_runtime_enable(&plat_dev->dev);
    +       pm_runtime_enable(&plat_dev->dev);
            pm_runtime_mark_last_busy(&plat_dev->dev);

            pm_runtime_set_autosuspend_delay(&plat_dev->dev, 50);
    @@ -1450,6 +1450,9 @@ pvr_probe(struct platform_device *plat_dev)
     err_watchdog_fini:
            pvr_watchdog_fini(pvr_dev);

    +       pm_runtime_dont_use_autosuspend(&plat_dev->dev);
    +       pm_runtime_disable(&plat_dev->dev);
    +
            pvr_queue_device_fini(pvr_dev);

     err_context_fini:
    @@ -1475,6 +1478,8 @@ static void pvr_remove(struct
platform_device *plat_dev)
            pvr_device_fini(pvr_dev);
            drm_dev_unplug(drm_dev);
            pvr_watchdog_fini(pvr_dev);
    +       pm_runtime_dont_use_autosuspend(&plat_dev->dev);
    +       pm_runtime_disable(&plat_dev->dev);
            pvr_queue_device_fini(pvr_dev);
            pvr_context_device_fini(pvr_dev);
            pvr_power_domains_fini(pvr_dev);

did not fix the issue.  Calling pm_runtime_force_suspend() instead of
pm_runtime_dont_use_autosuspend() also didn't help.

Do you have a clue?
Thanks!

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

Reply via email to