Thank you for your contribution! Sashiko AI review found 3 potential issue(s) 
to consider:

New issues:
- [High] The IOMMU hardware registers are left uninitialized after a job 
timeout and hardware reset, causing subsequent jobs to immediately fault.

Pre-existing issues:
- [High] Incorrect teardown ordering in `rocket_core_fini()` destroys IOMMU and 
PM resources before stopping the DRM scheduler, leading to data races, 
use-after-free, and potential hardware faults.
- [High] PM runtime usage counts and unsignaled fences are leaked on error 
paths in `rocket_job_run()`.
--

commit 49c877f1ba7baae4a28268850b55e2ac3ab4a8cc
Author: Midgy BALON <[email protected]>

accel: rocket: Keep the IOMMU domain attached across jobs
    
rocket attached the job's IOMMU domain in rocket_job_run() and
detached it again on every completion and reset. Each attach/detach
toggles the rk_iommu stall/force-reset/paging handshake, and on
RK3568 the NPU MMU is idle between jobs, so that handshake times out
and logs a burst of "stall/paging request timed out" errors for
every job.

Attach the per-context domain once and keep it: track the attached
domain in the core, swap it only when a job from a different context
runs, and detach it at core teardown. A reference on the attached
domain is held so it outlives the job that first attached it and is
released on swap/teardown.

> diff --git a/drivers/accel/rocket/rocket_core.c 
> b/drivers/accel/rocket/rocket_core.c
> index a8de876365873..634f78dfe2887 100644
> --- a/drivers/accel/rocket/rocket_core.c
> +++ b/drivers/accel/rocket/rocket_core.c

[ ... ]

> @@ -112,6 +113,11 @@ void rocket_core_fini(struct rocket_core *core)
>  {
>       pm_runtime_dont_use_autosuspend(core->dev);
>       pm_runtime_disable(core->dev);
> +     if (core->attached_domain) {
> +             iommu_detach_group(NULL, core->iommu_group);
> +             rocket_iommu_domain_put(core->attached_domain);
> +             core->attached_domain = NULL;
> +     }
>       iommu_group_put(core->iommu_group);
>       core->iommu_group = NULL;

[Severity: High]
This is a pre-existing issue, but IOMMU resources are being torn down before
the DRM scheduler is stopped in rocket_job_fini(core).

Because the scheduler might still be active, couldn't an in-flight job be
executing concurrently? This could lead to a data race on core->attached_domain,
the IOMMU domain being forcefully unmapped from a running job causing
AXI bus faults, or a use-after-free when rocket_job_run() accesses
core->iommu_group.

Should rocket_job_fini(core) be called before tearing down the IOMMU state to
ensure the hardware is idle and the scheduler is stopped?

>       rocket_job_fini(core);
>  }

[ ... ]

> diff --git a/drivers/accel/rocket/rocket_job.c 
> b/drivers/accel/rocket/rocket_job.c
> index e25234261536b..b248371be8a1e 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c

[ ... ]

> @@ -314,9 +315,26 @@ static struct dma_fence *rocket_job_run(struct 
> drm_sched_job *sched_job)
>       if (ret < 0)
>               return fence;

[Severity: High]
This isn't a bug introduced by this patch, but if pm_runtime_get_sync()
fails, it returns a negative error code but still increments the PM usage 
counter.

The error path returns without calling pm_runtime_put_noidle() or signaling
the fence, which leaks the PM counter and causes the scheduler to wait 
indefinitely.

> -     ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> -     if (ret < 0)
> -             return fence;
> +     /*
> +      * Attach the job's IOMMU domain only when it differs from the one
> +      * already attached. Re-attaching per job toggles the rk_iommu
> +      * stall/reset handshake on an idle NPU MMU, which is slow and
> +      * noisy; keep the domain attached across jobs instead.
> +      */
> +     if (core->attached_domain != job->domain) {
> +             if (core->attached_domain) {
> +                     iommu_detach_group(NULL, core->iommu_group);
> +                     rocket_iommu_domain_put(core->attached_domain);
> +                     core->attached_domain = NULL;
> +             }
> +
> +             ret = iommu_attach_group(job->domain->domain, 
> core->iommu_group);
> +             if (ret < 0)
> +                     return fence;

[Severity: High]
Also a pre-existing issue, but if iommu_attach_group() fails, it returns
the fence early without dropping the successfully acquired PM reference.
Should this path include a pm_runtime_put_noidle() call to avoid leaking
the reference?

> +
> +             kref_get(&job->domain->kref);
> +             core->attached_domain = job->domain;
> +     }
>  
>       scoped_guard(mutex, &core->job_lock) {
>               core->in_flight_job = job;

[ ... ]

> @@ -376,8 +393,6 @@ rocket_reset(struct rocket_core *core, struct 
> drm_sched_job *bad)
>        */
>       rocket_core_reset(core);
>  
> -     iommu_detach_group(NULL, core->iommu_group);
> -
>       /* NPU has been reset, we can clear the reset pending bit. */
>       atomic_set(&core->reset.pending, 0);

[Severity: High]
When the NPU is hard-reset by rocket_core_reset(), the IOMMU's hardware 
registers
(like page table base addresses) are wiped.

Since core->attached_domain is not cleared here, when the scheduler restarts
and executes the next job, rocket_job_run() will see core->attached_domain
matches the job's domain and skip calling iommu_attach_group().

Will this leave the IOMMU hardware unconfigured and cause subsequent jobs to
immediately throw DMA faults? Should core->attached_domain be cleared here
to force a re-attach and reprogramming of the IOMMU?

-- 
Sashiko AI review ยท 
https://sashiko.dev/#/patchset/[email protected]?part=5

Reply via email to