Thank you for your contribution! Sashiko AI review found 7 potential issue(s) to consider:
Pre-existing issues: - [Critical] Double free of `rjob->tasks` when `rocket_copy_tasks` fails. - [High] The shared IRQ handler accesses hardware registers without checking if the device is active via pm_runtime_get_if_active(). - [High] The runtime suspend callback powers down the hardware without synchronizing the shared IRQ. - [High] Timeout recovery drops the PM usage count early via `pm_runtime_put_noidle()`, stranding the device in an active power state and introducing a race condition. - [High] PM runtime usage counter leaks on error paths during job submission. - [High] Driver removal cancels pending PM transitions without synchronous suspend. - [High] Missing NULL check on `iommu_group_get()` leads to crash if device lacks an IOMMU. -- commit 54a038a17307496e6d47b23eab270cebaf07629e Author: Midgy BALON <[email protected]> accel: rocket: Reset the NPU before detaching the IOMMU on timeout This commit fixes a timeout issue where detaching the IOMMU before resetting the wedged NPU AXI master causes a stall. By asserting the core reset first, the AXI master is quiesced and the IOMMU detach completes cleanly. > diff --git a/drivers/accel/rocket/rocket_job.c > b/drivers/accel/rocket/rocket_job.c > index ac51bff39833f..e25234261536b 100644 > --- a/drivers/accel/rocket/rocket_job.c > +++ b/drivers/accel/rocket/rocket_job.c > @@ -364,14 +364,20 @@ rocket_reset(struct rocket_core *core, struct > drm_sched_job *bad) > if (core->in_flight_job) > pm_runtime_put_noidle(core->dev); [Severity: High] This is a pre-existing issue, but does using pm_runtime_put_noidle() here leave the device stranded in an active power state? Without restarting the autosuspend timer or scheduling an idle callback, the usage counter is dropped while leaving the device active. Could this also create a race condition if an idle check is triggered concurrently (such as via a sysfs write), causing the device to begin suspending while the subsequent rocket_core_reset() is toggling reset lines? > > - iommu_detach_group(NULL, core->iommu_group); > - > core->in_flight_job = NULL; > } > > - /* Proceed with reset now. */ > + /* > + * Reset the NPU hardware before detaching the IOMMU. A timed-out job > + * leaves the NPU AXI master wedged; detaching the IOMMU then issues a > + * stall request that never drains and times out (warning in the IOMMU > + * core). Asserting the core reset first quiesces the master so the > + * detach completes cleanly. > + */ > rocket_core_reset(core); > > + iommu_detach_group(NULL, core->iommu_group); > + > /* NPU has been reset, we can clear the reset pending bit. */ > atomic_set(&core->reset.pending, 0); > [Severity: High] This isn't a bug introduced by this patch, but looking at the shared IRQ handler: drivers/accel/rocket/rocket_job.c:rocket_job_irq_handler() { ... u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS); ... } Is it possible that this accesses hardware registers without verifying if the device is active? If the NPU is suspended and another device triggers the shared interrupt line, could this read from powered-off hardware and return invalid data (like 0xffffffff)? Might this be misinterpreted as a valid interrupt, leading to spurious threaded IRQ execution and potential bus crashes? Is it necessary to use pm_runtime_get_if_active() to verify the power state first? [Severity: High] This isn't a bug introduced by this patch, but does the runtime suspend callback power down hardware without synchronizing the shared IRQ? drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend() { ... clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks), rdev->cores[core].clks); ... } Since the driver uses a shared IRQ (IRQF_SHARED), could the handler execute mid-flight on another CPU while the hardware is actively being shut down? Should this call synchronize_irq() before disabling hardware clocks to prevent bus errors or system crashes? [Severity: High] This is a pre-existing issue, but does rocket_job_run() leak the PM usage counter on error paths? drivers/accel/rocket/rocket_job.c:rocket_job_run() { ... ret = pm_runtime_get_sync(core->dev); if (ret < 0) return fence; ... ret = iommu_attach_group(job->domain->domain, core->iommu_group); if (ret < 0) return fence; ... } If either pm_runtime_get_sync() or iommu_attach_group() fails, would returning an unsignaled fence early without balancing the usage counter via pm_runtime_put_noidle() permanently leak the PM reference? Might this prevent the device from ever entering runtime suspend, while also leaving the DRM scheduler stalled on a dead fence until it times out? [Severity: High] This is a pre-existing issue, but does driver removal cancel pending PM transitions without a synchronous suspend? drivers/accel/rocket/rocket_core.c:rocket_core_fini() { ... pm_runtime_disable(core->dev); ... } Because pm_runtime_disable() cancels pending asynchronous transitions (such as the autosuspend timer), wouldn't the NPU hardware clocks be left running indefinitely after the driver is removed and the device is unbound? Is a call to pm_runtime_put_sync() needed before disabling runtime PM? [Severity: High] This isn't a bug introduced by this patch, but is there a missing NULL check when fetching the IOMMU group? drivers/accel/rocket/rocket_core.c:rocket_core_init() { ... core->iommu_group = iommu_group_get(dev); ... } If the device lacks an IOMMU (for example, if it is disabled by the user or failed to probe), will iommu_group_get() return NULL? If so, when rocket_job_run() later executes: drivers/accel/rocket/rocket_job.c:rocket_job_run() { ... ret = iommu_attach_group(job->domain->domain, core->iommu_group); ... } Would this unconditionally dereference the group pointer and cause a kernel panic on the first job submission? [Severity: Critical] This is a pre-existing issue, but does a failure in rocket_copy_tasks() lead to a double free? drivers/accel/rocket/rocket_job.c:rocket_copy_tasks() { ... fail: kvfree(rjob->tasks); return ret; } If this fails (such as from an invalid pointer during copy_from_user()), the tasks array is freed, but is rjob->tasks left as a dangling pointer? When the caller jumps to out_cleanup_job and invokes rocket_job_put(rjob), it reaches: drivers/accel/rocket/rocket_job.c:rocket_job_cleanup() { ... kvfree(job->tasks); ... } Will this result in freeing the dangling pointer a second time? Can this double free corrupt kernel memory or cause a panic, potentially being exploitable by an unprivileged user? -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=4
