Re: [V2] drm/amdgpu: drop retry loop in amdgpu_hmm_range_get_pages

Huang, Honglei1 Sun, 31 May 2026 19:40:53 -0700



On 5/30/2026 2:35 AM, Chen, Xiaogang wrote:


On 5/29/2026 2:35 AM, Honglei Huang wrote:

Since commit 144ba981783f ("drm/amdgpu: fix amdgpu_hmm_range_get_pages")
moved mmu_interval_read_begin() out of the per-chunk loop, the
captured notifier_seq is no longer refreshed across retries. As a
result, the existing -EBUSY retry path can never make progress:

"retry" should come with mmu_interval_read_begin. The commit144ba981783f move mmu_interval_read_begin out of loop, then "retry"should also be moved out loop with mmu_interval_read_begin.

   hmm_range_fault() returns -EBUSY only when
   mmu_interval_check_retry(notifier, notifier_seq) reports that the
   sequence is stale. Once the sequence has advanced, the stored seq
   will never match again, so every subsequent call within the same
   invocation returns -EBUSY immediately.

The "goto retry" therefore degenerates into a busy spin that simply
burns CPU for the full HMM_RANGE_DEFAULT_TIMEOUT (~1s) window before
finally bailing out with -EAGAIN. This is pure latency with no chance
of recovery, and it actively hurts the KFD userptr stack: the caller
ends up blocked for a second while holding mmap_lock, only to return
-EAGAIN to the restore worker (or to userspace) which would have
re-driven the operation immediately anyway.

Drop the retry/timeout entirely and let -EBUSY propagate straight to
out_free_pfns, where it is already translated to -EAGAIN. Recovery is
handled at a higher level: the KFD restore_userptr_worker reschedules
itself, and the userptr ioctl path returns -EAGAIN to userspace.

No functional regression: the previous behaviour on -EBUSY was already
to fail with -EAGAIN after a 1s stall; we just skip the stall.

Reviewed-by: Christian König<[email protected]>
Signed-off-by: Honglei Huang<[email protected]>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 9 +--------
  1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
index 5d72878c8..229c30867 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
@@ -172,7 +172,6 @@ int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier 
*notifier,
        const u64 max_bytes = SZ_2G;

struct hmm_range *hmm_range = &range->hmm_range;

-       unsigned long timeout;
        unsigned long *pfns;
        unsigned long end;
        int r;
@@ -199,15 +198,9 @@ int amdgpu_hmm_range_get_pages(struct 
mmu_interval_notifier *notifier,
                pr_debug("hmm range: start = 0x%lx, end = 0x%lx",
                        hmm_range->start, hmm_range->end);

- timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);

-
-retry:

If you remove "retry" here the callers including user space need to redothe thing. This function's work is memory validation. It should do thatin its best before return to caller.

If user space is the caller there would be a lot of user space from/tokernel space context switches. That will make the procedure even slower.

I think we need keep "retry" inside this function, but move "retry" outof loop at mmu_interval_read_begin.

This option has been discussed inprevious:https://lore.kernel.org/amd-gfx/[email protected]/#:~:text=%3E%3E%20What%20probably%20needs%20to%20happen%20is%20that%20we%20need%20to%20move%20the%20retry%20label%20or%20just%20completely%20stop%20retrying%20at%20all.

Why drop the retry here is because to keep the behavior unchanged, thecurrent code logic will always return -EBUSY when goto retry path (pageschange when get pages).

And there is a scenario of extreme conditions where a huge system bufferattempts to use this function, causing the function to retryindefinitely, for example, with a space of 60G, it will always getsretry again and again.

Maybe it is about the overall architecture, the set of hmm function inamdgpu needs to be size limited or it needs to be cut into smallerpieces if the interval specified by the user is too large.


Regards,
Honglei


Regards
Xiaogang

                r = hmm_range_fault(hmm_range);
-               if (unlikely(r)) {
-                       if (r == -EBUSY && !time_after(jiffies, timeout))
-                               goto retry;
+               if (unlikely(r))
                        goto out_free_pfns;
-               }

if (hmm_range->end == end)

                        break;

Re: [V2] drm/amdgpu: drop retry loop in amdgpu_hmm_range_get_pages

Reply via email to