On 5/29/26 04:27, Honglei Huang wrote:
> Since commit 144ba981783f ("drm/amdgpu: fix amdgpu_hmm_range_get_pages")
> moved mmu_interval_read_begin() out of the per-chunk loop, the
> captured notifier_seq is no longer refreshed across retries. As a
> result, the existing -EBUSY retry path can never make progress:
>
> hmm_range_fault() returns -EBUSY only when
> mmu_interval_check_retry(notifier, notifier_seq) reports that the
> sequence is stale. Once the sequence has advanced, the stored seq
> will never match again, so every subsequent call within the same
> invocation returns -EBUSY immediately.
>
> The "goto retry" therefore degenerates into a busy spin that simply
> burns CPU for the full HMM_RANGE_DEFAULT_TIMEOUT (~1s) window before
> finally bailing out with -EAGAIN. This is pure latency with no chance
> of recovery, and it actively hurts the KFD userptr stack: the caller
> ends up blocked for a second while holding mmap_lock, only to return
> -EAGAIN to the restore worker (or to userspace) which would have
> re-driven the operation immediately anyway.
>
> Drop the retry/timeout entirely and let -EBUSY propagate straight to
> out_free_pfns, where it is already translated to -EAGAIN. Recovery is
> handled at a higher level: the KFD restore_userptr_worker reschedules
> itself, and the userptr ioctl path returns -EAGAIN to userspace.
>
> No functional regression: the previous behaviour on -EBUSY was already
> to fail with -EAGAIN after a 1s stall; we just skip the stall.
>
> Signed-off-by: Honglei Huang <[email protected]>
Reviewed-by: Christian König <[email protected]>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 9 +--------
> 1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> index 5d72878c8..229c30867 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> @@ -172,7 +172,6 @@ int amdgpu_hmm_range_get_pages(struct
> mmu_interval_notifier *notifier,
> const u64 max_bytes = SZ_2G;
>
> struct hmm_range *hmm_range = &range->hmm_range;
> - unsigned long timeout;
> unsigned long *pfns;
> unsigned long end;
> int r;
> @@ -199,15 +198,9 @@ int amdgpu_hmm_range_get_pages(struct
> mmu_interval_notifier *notifier,
> pr_debug("hmm range: start = 0x%lx, end = 0x%lx",
> hmm_range->start, hmm_range->end);
>
> - timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> -
> -retry:
> r = hmm_range_fault(hmm_range);
> - if (unlikely(r)) {
> - if (r == -EBUSY && !time_after(jiffies, timeout))
> - goto retry;
> + if (unlikely(r))
> goto out_free_pfns;
> - }
>
> if (hmm_range->end == end)
> break;