Re: [RFC PATCH v4 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout

sashiko-bot Sat, 13 Jun 2026 00:16:03 -0700

Thank you for your contribution! Sashiko AI review found 7 potential issue(s) 
to consider:


Pre-existing issues:
- [Critical] Double free of `rjob->tasks` when `rocket_copy_tasks` fails.
- [High] The shared IRQ handler accesses hardware registers without checking if 
the device is active via pm_runtime_get_if_active().
- [High] The runtime suspend callback powers down the hardware without 
synchronizing the shared IRQ.
- [High] Timeout recovery drops the PM usage count early via 
`pm_runtime_put_noidle()`, stranding the device in an active power state and 
introducing a race condition.
- [High] PM runtime usage counter leaks on error paths during job submission.
- [High] Driver removal cancels pending PM transitions without synchronous 
suspend.
- [High] Missing NULL check on `iommu_group_get()` leads to crash if device 
lacks an IOMMU.
--

commit 54a038a17307496e6d47b23eab270cebaf07629e
Author: Midgy BALON <[email protected]>

accel: rocket: Reset the NPU before detaching the IOMMU on timeout

This commit fixes a timeout issue where detaching the IOMMU before resetting 
the wedged NPU AXI master causes a stall. By asserting the core reset first, 
the AXI master is quiesced and the IOMMU detach completes cleanly.

> diff --git a/drivers/accel/rocket/rocket_job.c 
> b/drivers/accel/rocket/rocket_job.c
> index ac51bff39833f..e25234261536b 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -364,14 +364,20 @@ rocket_reset(struct rocket_core *core, struct 
> drm_sched_job *bad)
>               if (core->in_flight_job)
>                       pm_runtime_put_noidle(core->dev);

[Severity: High]
This is a pre-existing issue, but does using pm_runtime_put_noidle() here
leave the device stranded in an active power state?

Without restarting the autosuspend timer or scheduling an idle callback, the
usage counter is dropped while leaving the device active. 

Could this also create a race condition if an idle check is triggered
concurrently (such as via a sysfs write), causing the device to begin
suspending while the subsequent rocket_core_reset() is toggling reset lines?

>  
> -             iommu_detach_group(NULL, core->iommu_group);
> -
>               core->in_flight_job = NULL;
>       }
>  
> -     /* Proceed with reset now. */
> +     /*
> +      * Reset the NPU hardware before detaching the IOMMU. A timed-out job
> +      * leaves the NPU AXI master wedged; detaching the IOMMU then issues a
> +      * stall request that never drains and times out (warning in the IOMMU
> +      * core). Asserting the core reset first quiesces the master so the
> +      * detach completes cleanly.
> +      */
>       rocket_core_reset(core);
>  
> +     iommu_detach_group(NULL, core->iommu_group);
> +
>       /* NPU has been reset, we can clear the reset pending bit. */
>       atomic_set(&core->reset.pending, 0);
>  

[Severity: High]
This isn't a bug introduced by this patch, but looking at the shared IRQ
handler:

drivers/accel/rocket/rocket_job.c:rocket_job_irq_handler() {
    ...
    u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
    ...
}

Is it possible that this accesses hardware registers without verifying if the
device is active?

If the NPU is suspended and another device triggers the shared interrupt line,
could this read from powered-off hardware and return invalid data
(like 0xffffffff)?

Might this be misinterpreted as a valid interrupt, leading to spurious
threaded IRQ execution and potential bus crashes?

Is it necessary to use pm_runtime_get_if_active() to verify the power state
first?

[Severity: High]
This isn't a bug introduced by this patch, but does the runtime suspend
callback power down hardware without synchronizing the shared IRQ?

drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend() {
    ...
    clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks),
                               rdev->cores[core].clks);
    ...
}

Since the driver uses a shared IRQ (IRQF_SHARED), could the handler execute 
mid-flight on another CPU while the hardware is actively being shut down?

Should this call synchronize_irq() before disabling hardware clocks to prevent
bus errors or system crashes?

[Severity: High]
This is a pre-existing issue, but does rocket_job_run() leak the PM usage
counter on error paths?

drivers/accel/rocket/rocket_job.c:rocket_job_run() {
    ...
    ret = pm_runtime_get_sync(core->dev);
    if (ret < 0)
        return fence;
    ...
    ret = iommu_attach_group(job->domain->domain, core->iommu_group);
    if (ret < 0)
        return fence;
    ...
}

If either pm_runtime_get_sync() or iommu_attach_group() fails, would returning
an unsignaled fence early without balancing the usage counter via
pm_runtime_put_noidle() permanently leak the PM reference?

Might this prevent the device from ever entering runtime suspend, while also
leaving the DRM scheduler stalled on a dead fence until it times out?

[Severity: High]
This is a pre-existing issue, but does driver removal cancel pending PM
transitions without a synchronous suspend?

drivers/accel/rocket/rocket_core.c:rocket_core_fini() {
    ...
    pm_runtime_disable(core->dev);
    ...
}

Because pm_runtime_disable() cancels pending asynchronous transitions (such as
the autosuspend timer), wouldn't the NPU hardware clocks be left running
indefinitely after the driver is removed and the device is unbound?

Is a call to pm_runtime_put_sync() needed before disabling runtime PM?

[Severity: High]
This isn't a bug introduced by this patch, but is there a missing NULL check 
when fetching the IOMMU group?

drivers/accel/rocket/rocket_core.c:rocket_core_init() {
    ...
    core->iommu_group = iommu_group_get(dev);
    ...
}

If the device lacks an IOMMU (for example, if it is disabled by the user or
failed to probe), will iommu_group_get() return NULL?

If so, when rocket_job_run() later executes:

drivers/accel/rocket/rocket_job.c:rocket_job_run() {
    ...
    ret = iommu_attach_group(job->domain->domain, core->iommu_group);
    ...
}

Would this unconditionally dereference the group pointer and cause a kernel
panic on the first job submission?

[Severity: Critical]
This is a pre-existing issue, but does a failure in rocket_copy_tasks() lead 
to a double free?

drivers/accel/rocket/rocket_job.c:rocket_copy_tasks() {
    ...
fail:
    kvfree(rjob->tasks);
    return ret;
}

If this fails (such as from an invalid pointer during copy_from_user()), the
tasks array is freed, but is rjob->tasks left as a dangling pointer?

When the caller jumps to out_cleanup_job and invokes rocket_job_put(rjob),
it reaches:

drivers/accel/rocket/rocket_job.c:rocket_job_cleanup() {
    ...
    kvfree(job->tasks);
    ...
}

Will this result in freeing the dangling pointer a second time?

Can this double free corrupt kernel memory or cause a panic, potentially
being exploitable by an unprivileged user?

-- 
Sashiko AI review · 
https://sashiko.dev/#/patchset/[email protected]?part=4

Re: [RFC PATCH v4 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout

Reply via email to