On 5/30/2026 2:35 AM, Chen, Xiaogang wrote:

On 5/29/2026 2:35 AM, Honglei Huang wrote:
Since commit 144ba981783f ("drm/amdgpu: fix amdgpu_hmm_range_get_pages")
moved mmu_interval_read_begin() out of the per-chunk loop, the
captured notifier_seq is no longer refreshed across retries. As a
result, the existing -EBUSY retry path can never make progress:
"retry" should come with mmu_interval_read_begin. The commit 144ba981783f move mmu_interval_read_begin out of loop, then "retry" should also be moved out loop with mmu_interval_read_begin.
   hmm_range_fault() returns -EBUSY only when
   mmu_interval_check_retry(notifier, notifier_seq) reports that the
   sequence is stale. Once the sequence has advanced, the stored seq
   will never match again, so every subsequent call within the same
   invocation returns -EBUSY immediately.

The "goto retry" therefore degenerates into a busy spin that simply
burns CPU for the full HMM_RANGE_DEFAULT_TIMEOUT (~1s) window before
finally bailing out with -EAGAIN. This is pure latency with no chance
of recovery, and it actively hurts the KFD userptr stack: the caller
ends up blocked for a second while holding mmap_lock, only to return
-EAGAIN to the restore worker (or to userspace) which would have
re-driven the operation immediately anyway.

Drop the retry/timeout entirely and let -EBUSY propagate straight to
out_free_pfns, where it is already translated to -EAGAIN. Recovery is
handled at a higher level: the KFD restore_userptr_worker reschedules
itself, and the userptr ioctl path returns -EAGAIN to userspace.

No functional regression: the previous behaviour on -EBUSY was already
to fail with -EAGAIN after a 1s stall; we just skip the stall.

Reviewed-by: Christian König<[email protected]>
Signed-off-by: Honglei Huang<[email protected]>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 9 +--------
  1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
index 5d72878c8..229c30867 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
@@ -172,7 +172,6 @@ int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier 
*notifier,
        const u64 max_bytes = SZ_2G;
struct hmm_range *hmm_range = &range->hmm_range;
-       unsigned long timeout;
        unsigned long *pfns;
        unsigned long end;
        int r;
@@ -199,15 +198,9 @@ int amdgpu_hmm_range_get_pages(struct 
mmu_interval_notifier *notifier,
                pr_debug("hmm range: start = 0x%lx, end = 0x%lx",
                        hmm_range->start, hmm_range->end);
- timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
-
-retry:
If you remove "retry" here the callers including user space need to redo the thing. This function's work is memory validation. It should do that in its best before return to caller.

If user space is the caller there would be a lot of user space from/to kernel space context switches. That will make the procedure even slower.

I think we need keep "retry" inside this function, but move "retry" out of loop at mmu_interval_read_begin.

This option has been discussed in previous:https://lore.kernel.org/amd-gfx/[email protected]/#:~:text=%3E%3E%20What%20probably%20needs%20to%20happen%20is%20that%20we%20need%20to%20move%20the%20retry%20label%20or%20just%20completely%20stop%20retrying%20at%20all.

Why drop the retry here is because to keep the behavior unchanged, the current code logic will always return -EBUSY when goto retry path (pages change when get pages).

And there is a scenario of extreme conditions where a huge system buffer attempts to use this function, causing the function to retry indefinitely, for example, with a space of 60G, it will always gets retry again and again.

Maybe it is about the overall architecture, the set of hmm function in amdgpu needs to be size limited or it needs to be cut into smaller pieces if the interval specified by the user is too large.

Regards,
Honglei



Regards
Xiaogang

                r = hmm_range_fault(hmm_range);
-               if (unlikely(r)) {
-                       if (r == -EBUSY && !time_after(jiffies, timeout))
-                               goto retry;
+               if (unlikely(r))
                        goto out_free_pfns;
-               }
if (hmm_range->end == end)
                        break;

Reply via email to