Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider: - [High] Address Space reference (mmu->as_count) permanently leaks when hardware reset aborts perfcnt_enable_locked. - [High] A TOCTOU race between disable_locked and the asynchronous reset workqueue reintroduces the mmu->as_count underflow WARN. - [Medium] A positive error code (EAGAIN) is incorrectly returned directly to userspace instead of the negative standard (-EAGAIN). - [Medium] Spurious initialization failures occur for new perfcnt sessions due to a stale hw_reset_happened flag. --
commit 555dc068f79668462a998f3ee4fa95ead697090d Author: Adrián Larumbe <[email protected]> drm/panfrost: Make reset sequence deal with an active HWPerf session This commit introduces a perfcnt HW reset path to handle active HWPerf sessions during a HW reset. It renders perfcnt unusable after reset and notifies the user to take recovery measures. Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/88 Fixes: 7786fd108777 ("drm/panfrost: Expose performance counters through unstable ioctls") > diff --git a/drivers/gpu/drm/panfrost/panfrost_perfcnt.c > b/drivers/gpu/drm/panfrost/panfrost_perfcnt.c > index ad1156678e91c..c2087ea705fe6 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_perfcnt.c > +++ b/drivers/gpu/drm/panfrost/panfrost_perfcnt.c [ ... ] > @@ -173,6 +187,16 @@ static int panfrost_perfcnt_enable_locked(struct > panfrost_device *pfdev, > if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186)) > gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff); > > + /* If a reset happened, we've no way of knowing whether it was between > the time we called > + * panfrost_mmu_as_get() or before perfcnt_enable(), so clearing this > flag and going forward > + * isn't possible. We must clear the flag and try again in the hopes no > resets will happen > + * between this and the next ioctl invocation. > + */ > + if (atomic_cmpxchg(&perfcnt->hw_reset_happened, 1, 0)) { > + ret = EAGAIN; [Severity: Medium] Does this return a positive error code directly to userspace instead of the standard negative one (-EAGAIN)? DRM ioctls pass positive return values directly back to user space as the ioctl syscall result. Because of this, user space will incorrectly interpret this as success and subsequently fail on dump ioctls since the session wasn't actually enabled. > + goto err_disable; [Severity: High] Can this leak the Address Space reference if a hardware reset aborts the enable sequence? If a GPU reset occurs before panfrost_mmu_as_get() is called, the reset handler clears mmu->as_count to 0, and then panfrost_mmu_as_get() increments it to 1. The code subsequently detects the reset via the hw_reset_happened flag, clears it, and jumps to err_disable here. Because err_disable skips panfrost_mmu_as_put(), the AS slot reference is permanently leaked, eventually exhausting the limited hardware AS slots. > + } > + > /* The BO ref is retained by the mapping. */ > drm_gem_object_put(&bo->base); [ ... ] > @@ -209,7 +235,8 @@ static int panfrost_perfcnt_disable_locked(struct > panfrost_device *pfdev, > drm_gem_vunmap(&perfcnt->mapping->obj->base.base, &map); > perfcnt->buf = NULL; > panfrost_gem_close(&perfcnt->mapping->obj->base.base, file_priv); > - panfrost_mmu_as_put(pfdev, perfcnt->mapping->mmu); > + if (!atomic_read(&perfcnt->hw_reset_happened)) > + panfrost_mmu_as_put(pfdev, perfcnt->mapping->mmu); [Severity: High] Could a race between this check and an asynchronous reset reintroduce the mmu->as_count underflow WARN that this patch was explicitly intended to prevent? The non-atomic gap between reading the flag and executing panfrost_mmu_as_put() creates an exposure window. If an interleaved asynchronous hardware reset occurs in this window, panfrost_device_reset() will clear mmu->as_count to 0. The subsequent execution of panfrost_mmu_as_put() decrements it to -1, directly triggering the WARN. > panfrost_gem_mapping_put(perfcnt->mapping); > perfcnt->mapping = NULL; > pm_runtime_put_autosuspend(pfdev->base.dev); [ ... ] > @@ -346,3 +373,20 @@ void panfrost_perfcnt_fini(struct panfrost_device *pfdev) > /* Disable everything before leaving. */ > panfrost_perfcnt_gpu_disable(pfdev); > } > + > +void panfrost_perfcnt_reset(struct panfrost_device *pfdev) > +{ > + struct panfrost_perfcnt *perfcnt = pfdev->perfcnt; > + > + /* Since this function will be called either from a scheduled HW reset > + * or a runtime resume, tearing down any perfcnt resources means we're > + * doomed to deadlocking with perfcnt_{enable/disable}, since we'd have > + * to take the perfecnt lock. On top of that, it'd also violate DMA > fence > + * signalling rules because GFP_KERNEL allocations are made with the > perfcnt > + * lock taken in perfcnt_enable. In light of this, the only thing we > can do > + * is disabling perfcnt unconditionally, and notifying the perfcnt user > of > + * the reset having happpened so that they can take recovery measures. > + */ > + panfrost_perfcnt_gpu_disable(pfdev); > + atomic_set(&perfcnt->hw_reset_happened, 1); [Severity: Medium] Will this unconditionally set the reset flag even if no perfcnt session is active, causing the next new session to spuriously fail? Since this function is called during any GPU reset, it sets the flag even when perfcnt->user == NULL. If user space later attempts to start a completely new session, panfrost_perfcnt_enable_locked() will consume this stale flag and abort with EAGAIN, functionally breaking the feature on the first attempt after a reset. > +} -- Sashiko AI review · https://sashiko.dev/#/patchset/[email protected]?part=5
