On 5/29/2026 3:04 PM, Christian König wrote:
On 5/29/26 04:27, Honglei Huang wrote:Since commit 144ba981783f ("drm/amdgpu: fix amdgpu_hmm_range_get_pages") moved mmu_interval_read_begin() out of the per-chunk loop, the captured notifier_seq is no longer refreshed across retries. As a result, the existing -EBUSY retry path can never make progress: hmm_range_fault() returns -EBUSY only when mmu_interval_check_retry(notifier, notifier_seq) reports that the sequence is stale. Once the sequence has advanced, the stored seq will never match again, so every subsequent call within the same invocation returns -EBUSY immediately. The "goto retry" therefore degenerates into a busy spin that simply burns CPU for the full HMM_RANGE_DEFAULT_TIMEOUT (~1s) window before finally bailing out with -EAGAIN. This is pure latency with no chance of recovery, and it actively hurts the KFD userptr stack: the caller ends up blocked for a second while holding mmap_lock, only to return -EAGAIN to the restore worker (or to userspace) which would have re-driven the operation immediately anyway. Drop the retry/timeout entirely and let -EBUSY propagate straight to out_free_pfns, where it is already translated to -EAGAIN. Recovery is handled at a higher level: the KFD restore_userptr_worker reschedules itself, and the userptr ioctl path returns -EAGAIN to userspace. No functional regression: the previous behaviour on -EBUSY was already to fail with -EAGAIN after a 1s stall; we just skip the stall. Signed-off-by: Honglei Huang <[email protected]>Reviewed-by: Christian König <[email protected]>
Thanks a lot for the review, will respin with your R-b added. Regards, Honglei
--- drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c index 5d72878c8..229c30867 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c @@ -172,7 +172,6 @@ int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier, const u64 max_bytes = SZ_2G;struct hmm_range *hmm_range = &range->hmm_range;- unsigned long timeout; unsigned long *pfns; unsigned long end; int r; @@ -199,15 +198,9 @@ int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier, pr_debug("hmm range: start = 0x%lx, end = 0x%lx", hmm_range->start, hmm_range->end);- timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);- -retry: r = hmm_range_fault(hmm_range); - if (unlikely(r)) { - if (r == -EBUSY && !time_after(jiffies, timeout)) - goto retry; + if (unlikely(r)) goto out_free_pfns; - }if (hmm_range->end == end)break;
