Thank you for your contribution! Sashiko AI review found 9 potential issue(s) 
to consider:

Pre-existing issues:
- [Critical] Double free of rjob->tasks array when copy_from_user() fails.
- [High] The shared IRQ handler unconditionally accesses hardware registers 
without verifying the device's PM active state.
- [High] Missing PM reference drop and unsignaled fence return on error paths 
in `rocket_job_run()`.
- [High] Concurrent hardware reset and IRQ handler execution can cause an 
external abort.
- [High] Unbalanced teardown order in rocket_core_fini() causes UAF and data 
races.
- [High] NULL pointer dereference of job->domain in rocket_job_cleanup() on 
early error paths.
- [High] Missing `synchronize_irq()` in runtime suspend callback.
- [Medium] The TDR reset path permanently leaks power by preventing the device 
from entering runtime suspend.
- [Medium] The hardware reset is performed after dropping the PM runtime 
reference, potentially targeting an unpowered device.
--

commit 323afa3b038052ef6c02ee0548658a65126910a0
Author: Midgy BALON <[email protected]>

accel: rocket: Reset the NPU before detaching the IOMMU on timeout

On a job timeout the NPU AXI master can be left wedged with outstanding
transactions. rocket_reset() detached the IOMMU group before resetting
the hardware, causing the IOMMU core to stall. Assert the core reset
first to quiesce the master so the detach completes cleanly.

> diff --git a/drivers/accel/rocket/rocket_job.c 
> b/drivers/accel/rocket/rocket_job.c
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -364,14 +364,20 @@ rocket_reset(struct rocket_core *core, struct 
> drm_sched_job *bad)
>               if (core->in_flight_job)
>                       pm_runtime_put_noidle(core->dev);

[Severity: Medium]
This is a pre-existing issue, but does this TDR reset path permanently leak
power by preventing the device from entering runtime suspend?

It appears pm_runtime_put_noidle() drops the reference without triggering an
idle power-down. If the timed-out job was the last active job, the usage count
reaches zero but the device remains in the RPM_ACTIVE state indefinitely,
staying powered on until another job is submitted and correctly triggers
autosuspend.

> -             iommu_detach_group(NULL, core->iommu_group);
> -
>               core->in_flight_job = NULL;
>       }
>  
> -     /* Proceed with reset now. */
> +     /*
> +      * Reset the NPU hardware before detaching the IOMMU. A timed-out job
> +      * leaves the NPU AXI master wedged; detaching the IOMMU then issues a
> +      * stall request that never drains and times out (warning in the IOMMU
> +      * core). Asserting the core reset first quiesces the master so the
> +      * detach completes cleanly.
> +      */
>       rocket_core_reset(core);

[Severity: High]
This isn't a bug introduced by this patch, but can a concurrent hardware reset
and IRQ handler execution cause an external abort?

rocket_core_reset() asserts a hardware reset without first disabling device
interrupts. If an interrupt fires just before or during the reset,
rocket_job_irq_handler() will execute and perform MMIO reads and writes to the
bus while the hardware is actively in reset.

[Severity: Medium]
This is also a pre-existing issue, but is it safe to perform the hardware
reset after dropping the PM runtime reference?

Dropping the PM reference via pm_runtime_put_noidle() earlier decrements the
usage count. A concurrently scheduled autosuspend timer could expire precisely
in this window and power down the device before or during the reset operations.

Asserting resets on a suspended device can lead to system bus faults. Should
the PM reference be dropped after the hardware reset is complete?

> +
> +     iommu_detach_group(NULL, core->iommu_group);
> +
>       /* NPU has been reset, we can clear the reset pending bit. */
>       atomic_set(&core->reset.pending, 0);


[Severity: High]
This isn't a bug introduced by this patch, but does the shared IRQ handler
unconditionally access hardware registers without verifying the device's PM
active state?

drivers/accel/rocket/rocket_job.c:rocket_job_irq_handler() {
    ...
    u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
    ...
}

The driver registers rocket_job_irq_handler() with the IRQF_SHARED flag.
Without first calling pm_runtime_get_if_active(), if the NPU is runtime
suspended, this read accesses an unpowered register domain. This typically
returns 0xffffffff (causing spurious handling) or triggers a synchronous
external abort.


[Severity: High]
This is a pre-existing issue, but does rocket_job_run() leak the PM reference
and return an unsignaled fence on error paths?

drivers/accel/rocket/rocket_job.c:rocket_job_run() {
    ...
    ret = pm_runtime_get_sync(core->dev);
    if (ret < 0)
        return fence;
    ...
}

If pm_runtime_get_sync() or iommu_attach_group() fails, the function returns
a newly created, un-signaled fence instead of an error pointer, and leaves
the PM usage counter incremented. Because the hardware submission step is
skipped, the hardware will never signal this fence, causing the DRM scheduler
to hang indefinitely.


[Severity: High]
This is a pre-existing issue, but does the teardown order in rocket_core_fini()
cause a use-after-free and data races?

drivers/accel/rocket/rocket_core.c:rocket_core_fini() {
    ...
    iommu_group_put(core->iommu_group);
    core->iommu_group = NULL;
    rocket_job_fini(core);
    ...
}

Resources are torn down in the wrong order. The iommu_group and attached_domain
members are freed, and pm_runtime_disable() is called before rocket_job_fini().

If a scheduled job is processed by the DRM scheduler during this window,
rocket_job_run() will concurrently access the now-NULL iommu_group without
locking. Should rocket_job_fini() be called first to stop the scheduler and
wait for jobs?


[Severity: Critical]
This isn't a bug introduced by this patch, but is there a double free of the
rjob->tasks array when copy_from_user() fails?

drivers/accel/rocket/rocket_job.c:rocket_copy_tasks() {
    ...
fail:
    kvfree(rjob->tasks);
    return ret;
}

If copy_from_user() fails, it frees rjob->tasks but leaves the dangling pointer.

The function returns an error, prompting rocket_ioctl_submit_job() to jump to
out_cleanup_job and call rocket_job_put(rjob). This invokes
rocket_job_cleanup(), which unconditionally calls kvfree(job->tasks),
resulting in a double-free.


[Severity: High]
This is a pre-existing issue, but does rocket_job_cleanup() dereference a NULL
job->domain on early error paths?

drivers/accel/rocket/rocket_job.c:rocket_job_cleanup() {
    ...
    rocket_iommu_domain_put(job->domain);
    ...
}

The job->domain structure is initially NULL and only populated late in the
function. If an earlier step fails (such as rocket_copy_tasks()), the cleanup
unconditionally calls rocket_iommu_domain_put(job->domain).

This wrapper dereferences the domain struct to access its kref without a
NULL check:

drivers/accel/rocket/rocket_drv.c:rocket_iommu_domain_put() {
    kref_put(&domain->kref, rocket_iommu_domain_destroy);
}


[Severity: High]
This isn't a bug introduced by this patch, but is there a missing
synchronize_irq() in the runtime suspend callback?

drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend() {
    ...
    clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks),
                               rdev->cores[core].clks);
    ...
}

The driver registers a shared interrupt but fails to call synchronize_irq().
Disabling the clocks while the IRQ handler is executing concurrently on another
CPU could cause an AXI bus fault or a hard system crash.

-- 
Sashiko AI review ยท 
https://sashiko.dev/#/patchset/[email protected]?part=4

Reply via email to