On 5/28/26 09:22, Huang, Honglei1 wrote:
> AMD General
> 
> -----Original Message-----
> From: Koenig, Christian <[email protected]>
> Sent: Thursday, May 28, 2026 2:54 PM
> To: Huang, Honglei1 <[email protected]>; [email protected]
> Cc: Deucher, Alexander <[email protected]>; Huang, Ray 
> <[email protected]>; Prosyak, Vitaly <[email protected]>; Liu, Jenny 
> (Jing) <[email protected]>
> Subject: Re: [PATCH] drm/amdgpu: move notifier_seq read back inside retry loop
> 
> On 5/28/26 08:29, Honglei Huang wrote:
>> Align with drm_gpusvm_get_pages() (drm_gpusvm.c line 1416, 1440) which
>> refreshes notifier_seq via mmu_interval_read_begin() on each retry
>> iteration. Without this, a stale sequence number causes
>> hmm_range_fault() to perpetually return -EBUSY, leading to an infinite
>> retry loop at the caller level.
> 
> Absolutely clear NAK.
> 
> This is exactly the bug 144ba981783f ("drm/amdgpu: fix 
> amdgpu_hmm_range_get_pages") tries to fix.
> 
> The sequence *must* be grabbed before the loop because it protects all pages 
> and not just the one from the current grabbed chunk.
> 
> What probably needs to happen is that we need to move the retry label or just 
> completely stop retrying at all.
> 
> Got it.
> But in my local test, the test KFDMemoryTest.LargestSysBufferTest always fail.

I mean that's pretty much intentional. Testing is for falsification and not 
validation.

Either the use case is not valid in the first place or the test is not valid.

> It can pass before. This case needs RAM size big enough to reproduce,
> for some large RAM size CI platform, it can reproduce easily.
> Maybe someone can else can double check.
> 
> And I understand your concern about the seq scope.
> 
> Would it be acceptable to just remove the internal retry entirely and
> propagate -EBUSY to the caller?

Yeah I had a similar thought. I don't think the code should retry at all.

-EBUSY means userspace did something in parallel which resulting in the 
operation to not be able to complete.

So the only good reaction I can see is to abort and let user space or higher 
level retry.

Regards,
Christian.

> The caller already handles retry at a
> higher level. Something like:
> 
> hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
> r = hmm_range_fault(hmm_range);
> if (unlikely(r))
>     goto out_free_pfns;
> 
> This keeps your seq placement while eliminating the infinite loop on
> -EBUSY.
> 
> Regards,
> Christian.
> 
>>
>> Fixes: 144ba981783f ("drm/amdgpu: fix amdgpu_hmm_range_get_pages")
>> Signed-off-by: Honglei Huang <[email protected]>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
>> index 5d72878c8..ec0fe9044 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
>> @@ -192,7 +192,6 @@ int amdgpu_hmm_range_get_pages(struct 
>> mmu_interval_notifier *notifier,
>>       end = start + npages * PAGE_SIZE;
>>       hmm_range->dev_private_owner = owner;
>>
>> -     hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
>>       do {
>>               hmm_range->end = min(hmm_range->start + max_bytes, end);
>>
>> @@ -202,6 +201,7 @@ int amdgpu_hmm_range_get_pages(struct 
>> mmu_interval_notifier *notifier,
>>               timeout = jiffies + 
>> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
>>
>>  retry:
>> +             hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
>>               r = hmm_range_fault(hmm_range);
>>               if (unlikely(r)) {
>>                       if (r == -EBUSY && !time_after(jiffies, timeout))
> 

Reply via email to