Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
New issues: - [High] The IOMMU hardware registers are left uninitialized after a job timeout and hardware reset, causing subsequent jobs to immediately fault. Pre-existing issues: - [High] Incorrect teardown ordering in `rocket_core_fini()` destroys IOMMU and PM resources before stopping the DRM scheduler, leading to data races, use-after-free, and potential hardware faults. - [High] PM runtime usage counts and unsignaled fences are leaked on error paths in `rocket_job_run()`. -- commit 49c877f1ba7baae4a28268850b55e2ac3ab4a8cc Author: Midgy BALON <[email protected]> accel: rocket: Keep the IOMMU domain attached across jobs rocket attached the job's IOMMU domain in rocket_job_run() and detached it again on every completion and reset. Each attach/detach toggles the rk_iommu stall/force-reset/paging handshake, and on RK3568 the NPU MMU is idle between jobs, so that handshake times out and logs a burst of "stall/paging request timed out" errors for every job. Attach the per-context domain once and keep it: track the attached domain in the core, swap it only when a job from a different context runs, and detach it at core teardown. A reference on the attached domain is held so it outlives the job that first attached it and is released on swap/teardown. > diff --git a/drivers/accel/rocket/rocket_core.c > b/drivers/accel/rocket/rocket_core.c > index a8de876365873..634f78dfe2887 100644 > --- a/drivers/accel/rocket/rocket_core.c > +++ b/drivers/accel/rocket/rocket_core.c [ ... ] > @@ -112,6 +113,11 @@ void rocket_core_fini(struct rocket_core *core) > { > pm_runtime_dont_use_autosuspend(core->dev); > pm_runtime_disable(core->dev); > + if (core->attached_domain) { > + iommu_detach_group(NULL, core->iommu_group); > + rocket_iommu_domain_put(core->attached_domain); > + core->attached_domain = NULL; > + } > iommu_group_put(core->iommu_group); > core->iommu_group = NULL; [Severity: High] This is a pre-existing issue, but IOMMU resources are being torn down before the DRM scheduler is stopped in rocket_job_fini(core). Because the scheduler might still be active, couldn't an in-flight job be executing concurrently? This could lead to a data race on core->attached_domain, the IOMMU domain being forcefully unmapped from a running job causing AXI bus faults, or a use-after-free when rocket_job_run() accesses core->iommu_group. Should rocket_job_fini(core) be called before tearing down the IOMMU state to ensure the hardware is idle and the scheduler is stopped? > rocket_job_fini(core); > } [ ... ] > diff --git a/drivers/accel/rocket/rocket_job.c > b/drivers/accel/rocket/rocket_job.c > index e25234261536b..b248371be8a1e 100644 > --- a/drivers/accel/rocket/rocket_job.c > +++ b/drivers/accel/rocket/rocket_job.c [ ... ] > @@ -314,9 +315,26 @@ static struct dma_fence *rocket_job_run(struct > drm_sched_job *sched_job) > if (ret < 0) > return fence; [Severity: High] This isn't a bug introduced by this patch, but if pm_runtime_get_sync() fails, it returns a negative error code but still increments the PM usage counter. The error path returns without calling pm_runtime_put_noidle() or signaling the fence, which leaks the PM counter and causes the scheduler to wait indefinitely. > - ret = iommu_attach_group(job->domain->domain, core->iommu_group); > - if (ret < 0) > - return fence; > + /* > + * Attach the job's IOMMU domain only when it differs from the one > + * already attached. Re-attaching per job toggles the rk_iommu > + * stall/reset handshake on an idle NPU MMU, which is slow and > + * noisy; keep the domain attached across jobs instead. > + */ > + if (core->attached_domain != job->domain) { > + if (core->attached_domain) { > + iommu_detach_group(NULL, core->iommu_group); > + rocket_iommu_domain_put(core->attached_domain); > + core->attached_domain = NULL; > + } > + > + ret = iommu_attach_group(job->domain->domain, > core->iommu_group); > + if (ret < 0) > + return fence; [Severity: High] Also a pre-existing issue, but if iommu_attach_group() fails, it returns the fence early without dropping the successfully acquired PM reference. Should this path include a pm_runtime_put_noidle() call to avoid leaking the reference? > + > + kref_get(&job->domain->kref); > + core->attached_domain = job->domain; > + } > > scoped_guard(mutex, &core->job_lock) { > core->in_flight_job = job; [ ... ] > @@ -376,8 +393,6 @@ rocket_reset(struct rocket_core *core, struct > drm_sched_job *bad) > */ > rocket_core_reset(core); > > - iommu_detach_group(NULL, core->iommu_group); > - > /* NPU has been reset, we can clear the reset pending bit. */ > atomic_set(&core->reset.pending, 0); [Severity: High] When the NPU is hard-reset by rocket_core_reset(), the IOMMU's hardware registers (like page table base addresses) are wiped. Since core->attached_domain is not cleared here, when the scheduler restarts and executes the next job, rocket_job_run() will see core->attached_domain matches the job's domain and skip calling iommu_attach_group(). Will this leave the IOMMU hardware unconfigured and cause subsequent jobs to immediately throw DMA faults? Should core->attached_domain be cleared here to force a re-attach and reprogramming of the IOMMU? -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=5
