Thank you for your contribution! Sashiko AI review found 3 potential issue(s) 
to consider:

New issues:
- [High] rocket_core_fini() tears down the IOMMU domain without quiescing the 
hardware, causing IOMMU faults and wedged hardware.

Pre-existing issues:
- [High] Error paths in rocket_job_run() leak the PM runtime usage counter and 
return an unsignaled fence, causing permanent power drain and spurious timeouts.
- [High] Job timeout recovery leaves the NPU permanently powered on due to 
pm_runtime_put_noidle().
--

commit 629253998abe2060bd2b43f469323132b5743d36
Author: Midgy BALON <[email protected]>
Subject: accel: rocket: Keep the IOMMU domain attached across jobs

This commit attaches the IOMMU domain once per context instead of toggling
it on every job to avoid hardware stall timeouts. The domain is released on
context swap, core reset, or core teardown.

> diff --git a/drivers/accel/rocket/rocket_core.c 
> b/drivers/accel/rocket/rocket_core.c
> index 779e951596a15..6c128f585cff4 100644
> --- a/drivers/accel/rocket/rocket_core.c
> +++ b/drivers/accel/rocket/rocket_core.c
[ ... ]
> @@ -112,9 +113,20 @@ void rocket_core_fini(struct rocket_core *core)
>  {
>       pm_runtime_dont_use_autosuspend(core->dev);
>       pm_runtime_disable(core->dev);
> +
> +     /*
> +      * Stop the scheduler before tearing down the IOMMU so an in-flight
> +      * job can no longer touch the (about to be detached) domain.
> +      */
> +     rocket_job_fini(core);
> +
> +     if (core->attached_domain) {
> +             iommu_detach_group(NULL, core->iommu_group);

[Severity: High]
Does this detach the IOMMU page tables while the hardware is still active?

Calling drm_sched_fini() inside rocket_job_fini() only stops the software
scheduling thread, but it doesn't wait for the currently executing job on
the hardware to complete or reset the AXI master. Detaching the group while
the NPU is still executing could cause memory translation faults and wedge
the hardware.

Should the hardware be explicitly quiesced, for example by calling
rocket_core_reset(), before the IOMMU is detached here?

> +             rocket_iommu_domain_put(core->attached_domain);
> +             core->attached_domain = NULL;
> +     }
>       iommu_group_put(core->iommu_group);
>       core->iommu_group = NULL;
> -     rocket_job_fini(core);
>  }
>  
>  void rocket_core_reset(struct rocket_core *core)
> diff --git a/drivers/accel/rocket/rocket_job.c 
> b/drivers/accel/rocket/rocket_job.c
> index e25234261536b..368b2ebead1b3 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
[ ... ]
> @@ -314,9 +315,26 @@ static struct dma_fence *rocket_job_run(struct 
> drm_sched_job *sched_job)
>       if (ret < 0)
>               return fence;
>  
> -     ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> -     if (ret < 0)
> -             return fence;

[Severity: High]
This isn't a bug introduced by this patch, but does the pm_runtime_get_sync
error path just above this leak the PM runtime usage counter? 

In rocket_job_run(), the code does:

        ret = pm_runtime_get_sync(core->dev);
        if (ret < 0)
                return fence;

If pm_runtime_get_sync() fails, returning without calling
pm_runtime_put_noidle() leaves the counter elevated.

> +     /*
> +      * Attach the job's IOMMU domain only when it differs from the one
> +      * already attached. Re-attaching per job toggles the rk_iommu
> +      * stall/reset handshake on an idle NPU MMU, which is slow and
> +      * noisy; keep the domain attached across jobs instead.
> +      */
> +     if (core->attached_domain != job->domain) {
> +             if (core->attached_domain) {
> +                     iommu_detach_group(NULL, core->iommu_group);
> +                     rocket_iommu_domain_put(core->attached_domain);
> +                     core->attached_domain = NULL;
> +             }
> +
> +             ret = iommu_attach_group(job->domain->domain, 
> core->iommu_group);
> +             if (ret < 0)
> +                     return fence;

[Severity: High]
This is a pre-existing issue, but does this error path leak the PM runtime
reference acquired earlier in this function? 

Since the in_flight_job isn't set yet when returning the fence here, the
subsequent timeout handler won't clean this up, preventing the device from
ever entering low power states.

> +
> +             kref_get(&job->domain->kref);
> +             core->attached_domain = job->domain;
> +     }
>  
>       scoped_guard(mutex, &core->job_lock) {
>               core->in_flight_job = job;
[ ... ]
> @@ -376,7 +393,15 @@ rocket_reset(struct rocket_core *core, struct 
> drm_sched_job *bad)

[Severity: High]
This isn't a bug introduced by this patch, but does job timeout recovery
leave the NPU permanently powered on?

Earlier in rocket_reset(), the usage counter is decremented:

        scoped_guard(mutex, &core->job_lock) {
                if (core->in_flight_job)
                        pm_runtime_put_noidle(core->dev);

                core->in_flight_job = NULL;
        }

Because pm_runtime_put_noidle() does not invoke an idle check or queue an
auto-suspend, doesn't the device remain in the RPM_ACTIVE state and stay
physically powered on indefinitely? Should pm_runtime_put_autosuspend() be
used here instead?

>        */
>       rocket_core_reset(core);
>  
> -     iommu_detach_group(NULL, core->iommu_group);
> +     /*
> +      * The reset wipes the IOMMU page-table base, so drop the attached
> +      * domain to force the next job to re-attach and reprogram it.
> +      */
> +     if (core->attached_domain) {
> +             iommu_detach_group(NULL, core->iommu_group);
> +             rocket_iommu_domain_put(core->attached_domain);
> +             core->attached_domain = NULL;
> +     }
>  
>       /* NPU has been reset, we can clear the reset pending bit. */
>       atomic_set(&core->reset.pending, 0);

-- 
Sashiko AI review ยท 
https://sashiko.dev/#/patchset/[email protected]?part=5

Reply via email to