Thank you for your contribution! Sashiko AI review found 4 potential issue(s) 
to consider:
- [High] Address Space reference (mmu->as_count) permanently leaks when 
hardware reset aborts perfcnt_enable_locked.
- [High] A TOCTOU race between disable_locked and the asynchronous reset 
workqueue reintroduces the mmu->as_count underflow WARN.
- [Medium] A positive error code (EAGAIN) is incorrectly returned directly to 
userspace instead of the negative standard (-EAGAIN).
- [Medium] Spurious initialization failures occur for new perfcnt sessions due 
to a stale hw_reset_happened flag.
--

commit 555dc068f79668462a998f3ee4fa95ead697090d
Author: Adrián Larumbe <[email protected]>

drm/panfrost: Make reset sequence deal with an active HWPerf session

This commit introduces a perfcnt HW reset path to handle active HWPerf
sessions during a HW reset. It renders perfcnt unusable after reset and
notifies the user to take recovery measures.

Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/88
Fixes: 7786fd108777 ("drm/panfrost: Expose performance counters through 
unstable ioctls")

> diff --git a/drivers/gpu/drm/panfrost/panfrost_perfcnt.c 
> b/drivers/gpu/drm/panfrost/panfrost_perfcnt.c
> index ad1156678e91c..c2087ea705fe6 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_perfcnt.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_perfcnt.c

[ ... ]

> @@ -173,6 +187,16 @@ static int panfrost_perfcnt_enable_locked(struct 
> panfrost_device *pfdev,
>       if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
>               gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
>  
> +     /* If a reset happened, we've no way of knowing whether it was between 
> the time we called
> +      * panfrost_mmu_as_get() or before perfcnt_enable(), so clearing this 
> flag and going forward
> +      * isn't possible. We must clear the flag and try again in the hopes no 
> resets will happen
> +      * between this and the next ioctl invocation.
> +      */
> +     if (atomic_cmpxchg(&perfcnt->hw_reset_happened, 1, 0)) {
> +             ret = EAGAIN;

[Severity: Medium]
Does this return a positive error code directly to userspace instead of the
standard negative one (-EAGAIN)?

DRM ioctls pass positive return values directly back to user space as the
ioctl syscall result. Because of this, user space will incorrectly interpret
this as success and subsequently fail on dump ioctls since the session wasn't
actually enabled.

> +             goto err_disable;

[Severity: High]
Can this leak the Address Space reference if a hardware reset aborts the
enable sequence?

If a GPU reset occurs before panfrost_mmu_as_get() is called, the reset
handler clears mmu->as_count to 0, and then panfrost_mmu_as_get() increments
it to 1. The code subsequently detects the reset via the hw_reset_happened
flag, clears it, and jumps to err_disable here. Because err_disable skips
panfrost_mmu_as_put(), the AS slot reference is permanently leaked,
eventually exhausting the limited hardware AS slots.

> +     }
> +
>       /* The BO ref is retained by the mapping. */
>       drm_gem_object_put(&bo->base);

[ ... ]

> @@ -209,7 +235,8 @@ static int panfrost_perfcnt_disable_locked(struct 
> panfrost_device *pfdev,
>       drm_gem_vunmap(&perfcnt->mapping->obj->base.base, &map);
>       perfcnt->buf = NULL;
>       panfrost_gem_close(&perfcnt->mapping->obj->base.base, file_priv);
> -     panfrost_mmu_as_put(pfdev, perfcnt->mapping->mmu);
> +     if (!atomic_read(&perfcnt->hw_reset_happened))
> +             panfrost_mmu_as_put(pfdev, perfcnt->mapping->mmu);

[Severity: High]
Could a race between this check and an asynchronous reset reintroduce the
mmu->as_count underflow WARN that this patch was explicitly intended to
prevent?

The non-atomic gap between reading the flag and executing
panfrost_mmu_as_put() creates an exposure window. If an interleaved
asynchronous hardware reset occurs in this window, panfrost_device_reset()
will clear mmu->as_count to 0. The subsequent execution of
panfrost_mmu_as_put() decrements it to -1, directly triggering the WARN.

>       panfrost_gem_mapping_put(perfcnt->mapping);
>       perfcnt->mapping = NULL;
>       pm_runtime_put_autosuspend(pfdev->base.dev);

[ ... ]

> @@ -346,3 +373,20 @@ void panfrost_perfcnt_fini(struct panfrost_device *pfdev)
>       /* Disable everything before leaving. */
>       panfrost_perfcnt_gpu_disable(pfdev);
>  }
> +
> +void panfrost_perfcnt_reset(struct panfrost_device *pfdev)
> +{
> +     struct panfrost_perfcnt *perfcnt = pfdev->perfcnt;
> +
> +     /* Since this function will be called either from a scheduled HW reset
> +      * or a runtime resume, tearing down any perfcnt resources means we're
> +      * doomed to deadlocking with perfcnt_{enable/disable}, since we'd have
> +      * to take the perfecnt lock. On top of that, it'd also violate DMA 
> fence
> +      * signalling rules because GFP_KERNEL allocations are made with the 
> perfcnt
> +      * lock taken in perfcnt_enable. In light of this, the only thing we 
> can do
> +      * is disabling perfcnt unconditionally, and notifying the perfcnt user 
> of
> +      * the reset having happpened so that they can take recovery measures.
> +      */
> +     panfrost_perfcnt_gpu_disable(pfdev);
> +     atomic_set(&perfcnt->hw_reset_happened, 1);

[Severity: Medium]
Will this unconditionally set the reset flag even if no perfcnt session is
active, causing the next new session to spuriously fail?

Since this function is called during any GPU reset, it sets the flag even
when perfcnt->user == NULL. If user space later attempts to start a
completely new session, panfrost_perfcnt_enable_locked() will consume this
stale flag and abort with EAGAIN, functionally breaking the feature on the
first attempt after a reset.

> +}

-- 
Sashiko AI review · 
https://sashiko.dev/#/patchset/[email protected]?part=5

Reply via email to