On 5/29/26 04:27, Honglei Huang wrote:
> Since commit 144ba981783f ("drm/amdgpu: fix amdgpu_hmm_range_get_pages")
> moved mmu_interval_read_begin() out of the per-chunk loop, the
> captured notifier_seq is no longer refreshed across retries. As a
> result, the existing -EBUSY retry path can never make progress:
> 
>   hmm_range_fault() returns -EBUSY only when
>   mmu_interval_check_retry(notifier, notifier_seq) reports that the
>   sequence is stale. Once the sequence has advanced, the stored seq
>   will never match again, so every subsequent call within the same
>   invocation returns -EBUSY immediately.
> 
> The "goto retry" therefore degenerates into a busy spin that simply
> burns CPU for the full HMM_RANGE_DEFAULT_TIMEOUT (~1s) window before
> finally bailing out with -EAGAIN. This is pure latency with no chance
> of recovery, and it actively hurts the KFD userptr stack: the caller
> ends up blocked for a second while holding mmap_lock, only to return
> -EAGAIN to the restore worker (or to userspace) which would have
> re-driven the operation immediately anyway.
> 
> Drop the retry/timeout entirely and let -EBUSY propagate straight to
> out_free_pfns, where it is already translated to -EAGAIN. Recovery is
> handled at a higher level: the KFD restore_userptr_worker reschedules
> itself, and the userptr ioctl path returns -EAGAIN to userspace.
> 
> No functional regression: the previous behaviour on -EBUSY was already
> to fail with -EAGAIN after a 1s stall; we just skip the stall.
> 
> Signed-off-by: Honglei Huang <[email protected]>

Reviewed-by: Christian König <[email protected]>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> index 5d72878c8..229c30867 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> @@ -172,7 +172,6 @@ int amdgpu_hmm_range_get_pages(struct 
> mmu_interval_notifier *notifier,
>       const u64 max_bytes = SZ_2G;
>  
>       struct hmm_range *hmm_range = &range->hmm_range;
> -     unsigned long timeout;
>       unsigned long *pfns;
>       unsigned long end;
>       int r;
> @@ -199,15 +198,9 @@ int amdgpu_hmm_range_get_pages(struct 
> mmu_interval_notifier *notifier,
>               pr_debug("hmm range: start = 0x%lx, end = 0x%lx",
>                       hmm_range->start, hmm_range->end);
>  
> -             timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> -
> -retry:
>               r = hmm_range_fault(hmm_range);
> -             if (unlikely(r)) {
> -                     if (r == -EBUSY && !time_after(jiffies, timeout))
> -                             goto retry;
> +             if (unlikely(r))
>                       goto out_free_pfns;
> -             }
>  
>               if (hmm_range->end == end)
>                       break;

Reply via email to