AMD General

-----Original Message-----
From: Koenig, Christian <[email protected]>
Sent: Thursday, May 28, 2026 2:54 PM
To: Huang, Honglei1 <[email protected]>; [email protected]
Cc: Deucher, Alexander <[email protected]>; Huang, Ray 
<[email protected]>; Prosyak, Vitaly <[email protected]>; Liu, Jenny 
(Jing) <[email protected]>
Subject: Re: [PATCH] drm/amdgpu: move notifier_seq read back inside retry loop

On 5/28/26 08:29, Honglei Huang wrote:
> Align with drm_gpusvm_get_pages() (drm_gpusvm.c line 1416, 1440) which
> refreshes notifier_seq via mmu_interval_read_begin() on each retry
> iteration. Without this, a stale sequence number causes
> hmm_range_fault() to perpetually return -EBUSY, leading to an infinite
> retry loop at the caller level.

Absolutely clear NAK.

This is exactly the bug 144ba981783f ("drm/amdgpu: fix 
amdgpu_hmm_range_get_pages") tries to fix.

The sequence *must* be grabbed before the loop because it protects all pages 
and not just the one from the current grabbed chunk.

What probably needs to happen is that we need to move the retry label or just 
completely stop retrying at all.

Got it.
But in my local test, the test KFDMemoryTest.LargestSysBufferTest always fail.
It can pass before. This case needs RAM size big enough to reproduce,
for some large RAM size CI platform, it can reproduce easily.
Maybe someone can else can double check.

And I understand your concern about the seq scope.

Would it be acceptable to just remove the internal retry entirely and
propagate -EBUSY to the caller? The caller already handles retry at a
higher level. Something like:

hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
r = hmm_range_fault(hmm_range);
if (unlikely(r))
    goto out_free_pfns;

This keeps your seq placement while eliminating the infinite loop on
-EBUSY.

Regards,
Christian.

>
> Fixes: 144ba981783f ("drm/amdgpu: fix amdgpu_hmm_range_get_pages")
> Signed-off-by: Honglei Huang <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> index 5d72878c8..ec0fe9044 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> @@ -192,7 +192,6 @@ int amdgpu_hmm_range_get_pages(struct 
> mmu_interval_notifier *notifier,
>       end = start + npages * PAGE_SIZE;
>       hmm_range->dev_private_owner = owner;
>
> -     hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
>       do {
>               hmm_range->end = min(hmm_range->start + max_bytes, end);
>
> @@ -202,6 +201,7 @@ int amdgpu_hmm_range_get_pages(struct 
> mmu_interval_notifier *notifier,
>               timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
>
>  retry:
> +             hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
>               r = hmm_range_fault(hmm_range);
>               if (unlikely(r)) {
>                       if (r == -EBUSY && !time_after(jiffies, timeout))

Reply via email to