On 2/9/26 15:16, Honglei Huang wrote:
> The case you described: one hmm_range_fault() invalidating another's
> seq under the same notifier, is already handled in the implementation.
>
> example: suppose ranges A, B, C share one notifier:
>
> 1. hmm_range_fault(A) succeeds, seq_A recorded
> 2. External invalidation occurs, triggers callback:
> mutex_lock(notifier_lock)
> → mmu_interval_set_seq()
> → range_A->valid = false
> → mem->invalid++
> mutex_unlock(notifier_lock)
> 3. hmm_range_fault(B) succeeds
> 4. Commit phase:
> mutex_lock(notifier_lock)
> → check mem->invalid != saved_invalid
> → return -EAGAIN, retry the entire batch
> mutex_unlock(notifier_lock)
>
> All concurrent invalidations are caught by the mem->invalid counter.
> Additionally, amdgpu_ttm_tt_get_user_pages_done() in
> confirm_valid_user_pages_locked
> performs a per-range mmu_interval_read_retry() as a final safety check.
>
> DRM GPU SVM uses the same approach: drm_gpusvm_get_pages() also calls
> hmm_range_fault() per-range independently there is no array version
> of hmm_range_fault in DRM GPU SVM either. If you consider this approach
> unworkable, then DRM GPU SVM would be unworkable too, yet it has been
> accepted upstream.
>
> The number of batch ranges is controllable. And even if it
> scales to thousands, DRM GPU SVM faces exactly the same situation:
> it does not need an array version of hmm_range_fault either, which
> shows this is a correctness question, not a performance one. For
> correctness, I believe DRM GPU SVM already demonstrates the approach
> is ok.
Well yes, GPU SVM would have exactly the same problems. But that also doesn't
have a create bulk userptr interface.
The implementation is simply not made for this use case, and as far as I know
no current upstream implementation is.
> For performance, I have tested with thousands of ranges present:
> performance reaches 80%~95% of the native driver, and all OpenCL
> and ROCr test suites pass with no correctness issues.
Testing can only falsify a system and not verify it.
> Here is how DRM GPU SVM handles correctness with multiple ranges
> under one wide notifier doing per-range hmm_range_fault:
>
> Invalidation: drm_gpusvm_notifier_invalidate()
> - Acquires notifier_lock
> - Calls mmu_interval_set_seq()
> - Iterates affected ranges via driver callback (xe_svm_invalidate)
> - Clears has_dma_mapping = false for each affected range (under lock)
> - Releases notifier_lock
>
> Fault: drm_gpusvm_get_pages() (called per-range independently)
> - mmu_interval_read_begin() to get seq
> - hmm_range_fault() outside lock
> - Acquires notifier_lock
> - mmu_interval_read_retry() → if stale, release lock and retry
> - DMA map pages + set has_dma_mapping = true (under lock)
> - Releases notifier_lock
>
> Validation: drm_gpusvm_pages_valid()
> - Checks has_dma_mapping flag (under lock), NOT seq
>
> If invalidation occurs between two per-range faults, the flag is
> cleared under lock, and either mmu_interval_read_retry catches it
> in the current fault, or drm_gpusvm_pages_valid() catches it at
> validation time. No stale pages are ever committed.
>
> KFD batch userptr uses the same three-step pattern:
>
> Invalidation: amdgpu_amdkfd_evict_userptr_batch()
> - Acquires notifier_lock
> - Calls mmu_interval_set_seq()
> - Iterates affected ranges via interval_tree
> - Sets range->valid = false for each affected range (under lock)
> - Increments mem->invalid (under lock)
> - Releases notifier_lock
>
> Fault: update_invalid_user_pages()
> - Per-range hmm_range_fault() outside lock
And here the idea falls apart. Each hmm_range_fault() can invalidate the other
ranges while faulting them in.
That is not fundamentally solveable, but by moving the handling further into
hmm_range_fault it makes it much less likely that something goes wrong.
So once more as long as this still uses this hacky approach I will clearly
reject this implementation.
Regards,
Christian.
> - Acquires notifier_lock
> - Checks mem->invalid != saved_invalid → if changed, -EAGAIN retry
> - Sets range->valid = true for faulted ranges (under lock)
> - Releases notifier_lock
>
> Validation: valid_user_pages_batch()
> - Checks range->valid flag
> - Calls amdgpu_ttm_tt_get_user_pages_done() (mmu_interval_read_retry)
>
> The logic is equivalent as far as I can see.
>
> Regards,
> Honglei
>
>
>
> On 2026/2/9 21:27, Christian König wrote:
>> On 2/9/26 14:11, Honglei Huang wrote:
>>>
>>> So the drm svm is also a NAK?
>>>
>>> These codes have passed local testing, opencl and rocr, I also provided a
>>> detailed code path and analysis.
>>> You only said the conclusion without providing any reasons or evidence.
>>> Your statement has no justifiable reasons and is difficult to convince
>>> so far.
>>
>> That sounds like you don't understand what the issue here is, I will try to
>> explain this once more on pseudo-code.
>>
>> Page tables are updated without holding a lock, so when you want to grab
>> physical addresses from the then you need to use an opportunistically retry
>> based approach to make sure that the data you got is still valid.
>>
>> In other words something like this here is needed:
>>
>> retry:
>> hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>> hmm_range.hmm_pfns = kvmalloc_array(npages, ...);
>> ...
>> while (true) {
>> mmap_read_lock(mm);
>> err = hmm_range_fault(&hmm_range);
>> mmap_read_unlock(mm);
>>
>> if (err == -EBUSY) {
>> if (time_after(jiffies, timeout))
>> break;
>>
>> hmm_range.notifier_seq =
>> mmu_interval_read_begin(notifier);
>> continue;
>> }
>> break;
>> }
>> ...
>> for (i = 0, j = 0; i < npages; ++j) {
>> ...
>> dma_map_page(...)
>> ...
>> grab_notifier_lock();
>> if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq))
>> goto retry;
>> restart_queues();
>> drop_notifier_lock();
>> ...
>>
>> Now hmm_range.notifier_seq indicates if your DMA addresses are still valid
>> or not after you grabbed the notifier lock.
>>
>> The problem is that hmm_range works only on a single range/sequence
>> combination, so when you do multiple calls to hmm_range_fault() for
>> scattered VA is can easily be that one call invalidates the ranges of
>> another call.
>>
>> So as long as you only have a few hundred hmm_ranges for your userptrs that
>> kind of works, but it doesn't scale up into the thousands of different VA
>> addresses you get for scattered handling.
>>
>> That's why hmm_range_fault needs to be modified to handle an array of VA
>> addresses instead of just a A..B range.
>>
>> Regards,
>> Christian.
>>
>>
>>>
>>> On 2026/2/9 20:59, Christian König wrote:
>>>> On 2/9/26 13:52, Honglei Huang wrote:
>>>>> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
>>>>
>>>> I'm not sure what you are talking about, drm_gpusvm_get_pages() only
>>>> supports a single range as well and not scatter gather of VA addresses.
>>>>
>>>> As far as I can see that doesn't help the slightest.
>>>>
>>>>> My implementation follows the same pattern. The detailed comparison
>>>>> of invalidation path was provided in the second half of my previous mail.
>>>>
>>>> Yeah and as I said that is not very valuable because it doesn't solves the
>>>> sequence problem.
>>>>
>>>> As far as I can see the approach you try here is a clear NAK from my side.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> On 2026/2/9 18:16, Christian König wrote:
>>>>>> On 2/9/26 07:14, Honglei Huang wrote:
>>>>>>>
>>>>>>> I've reworked the implementation in v4. The fix is actually inspired
>>>>>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>>>>>
>>>>>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>>>>>> multiple user virtual address ranges under a single
>>>>>>> mmu_interval_notifier,
>>>>>>> and these ranges can be non-contiguous which is essentially the same
>>>>>>> problem that batch userptr needs to solve: one BO backed by multiple
>>>>>>> non-contiguous CPU VA ranges sharing one notifier.
>>>>>>
>>>>>> That still doesn't solve the sequencing problem.
>>>>>>
>>>>>> As far as I can see you can't use hmm_range_fault with this approach or
>>>>>> it would just not be very valuable.
>>>>>>
>>>>>> So how should that work with your patch set?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>>>>> notifier->itree.start = ALIGN_DOWN(fault_addr,
>>>>>>> gpusvm->notifier_size);
>>>>>>> notifier->itree.last = ALIGN(fault_addr + 1,
>>>>>>> gpusvm->notifier_size) - 1;
>>>>>>> The Xe driver passes
>>>>>>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>>>>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>>>>>> containing multiple non-contiguous ranges.
>>>>>>>
>>>>>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>>>>>> validation instead of seq-based validation in:
>>>>>>> - drm_gpusvm_pages_valid() checks
>>>>>>> flags.has_dma_mapping
>>>>>>> not notifier_seq. The comment explicitly states:
>>>>>>> "This is akin to a notifier seqno check in the HMM
>>>>>>> documentation
>>>>>>> but due to wider notifiers (i.e., notifiers which span
>>>>>>> multiple
>>>>>>> ranges) this function is required for finer grained checking"
>>>>>>> - __drm_gpusvm_unmap_pages() clears
>>>>>>> flags.has_dma_mapping = false under notifier_lock
>>>>>>> - drm_gpusvm_get_pages() sets
>>>>>>> flags.has_dma_mapping = true under notifier_lock
>>>>>>> I adopted the same approach.
>>>>>>>
>>>>>>> DRM GPU SVM:
>>>>>>> drm_gpusvm_notifier_invalidate()
>>>>>>> down_write(&gpusvm->notifier_lock);
>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>> gpusvm->ops->invalidate()
>>>>>>> -> xe_svm_invalidate()
>>>>>>> drm_gpusvm_for_each_range()
>>>>>>> -> __drm_gpusvm_unmap_pages()
>>>>>>> WRITE_ONCE(flags.has_dma_mapping = false); // clear
>>>>>>> flag
>>>>>>> up_write(&gpusvm->notifier_lock);
>>>>>>>
>>>>>>> KFD batch userptr:
>>>>>>> amdgpu_amdkfd_evict_userptr_batch()
>>>>>>> mutex_lock(&process_info->notifier_lock);
>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>> discard_invalid_ranges()
>>>>>>> interval_tree_iter_first/next()
>>>>>>> range_info->valid = false; // clear flag
>>>>>>> mutex_unlock(&process_info->notifier_lock);
>>>>>>>
>>>>>>> Both implementations:
>>>>>>> - Acquire notifier_lock FIRST, before any flag changes
>>>>>>> - Call mmu_interval_set_seq() under the lock
>>>>>>> - Use interval tree to find affected ranges within the wide
>>>>>>> notifier
>>>>>>> - Mark per-range flag as invalid/valid under the lock
>>>>>>>
>>>>>>> The page fault path and final validation path also follow the same
>>>>>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>>>>>> flag under the lock.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Honglei
>>>>>>>
>>>>>>>
>>>>>>> On 2026/2/6 21:56, Christian König wrote:
>>>>>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>>>>>> From: Honglei Huang <[email protected]>
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> This is v3 of the patch series to support allocating multiple
>>>>>>>>> non-contiguous
>>>>>>>>> CPU virtual address ranges that map to a single contiguous GPU
>>>>>>>>> virtual address.
>>>>>>>>>
>>>>>>>>> v3:
>>>>>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>>>>
>>>>>>>> That is most likely not the best approach, but Felix or Philip need to
>>>>>>>> comment here since I don't know such IOCTLs well either.
>>>>>>>>
>>>>>>>>> - When flag is set, mmap_offset field points to range array
>>>>>>>>> - Minimal API surface change
>>>>>>>>
>>>>>>>> Why range of VA space for each entry?
>>>>>>>>
>>>>>>>>> 2. Improved MMU notifier handling:
>>>>>>>>> - Single mmu_interval_notifier covering the VA span [va_min,
>>>>>>>>> va_max]
>>>>>>>>> - Interval tree for efficient lookup of affected ranges during
>>>>>>>>> invalidation
>>>>>>>>> - Avoids per-range notifier overhead mentioned in v2 review
>>>>>>>>
>>>>>>>> That won't work unless you also modify hmm_range_fault() to take
>>>>>>>> multiple VA addresses (or ranges) at the same time.
>>>>>>>>
>>>>>>>> The problem is that we must rely on hmm_range.notifier_seq to detect
>>>>>>>> changes to the page tables in question, but that in turn works only if
>>>>>>>> you have one hmm_range structure and not multiple.
>>>>>>>>
>>>>>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq
>>>>>>>> you have, but that is a bit flaky.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 3. Better code organization: Split into 8 focused patches for easier
>>>>>>>>> review
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>> - Each CPU VA range gets its own mmu_interval_notifier for
>>>>>>>>> invalidation
>>>>>>>>> - All ranges validated together and mapped to contiguous GPU VA
>>>>>>>>> - Single kgd_mem object with array of user_range_info
>>>>>>>>> structures
>>>>>>>>> - Unified eviction/restore path for all ranges in a batch
>>>>>>>>>
>>>>>>>>> Current Implementation Approach
>>>>>>>>> ===============================
>>>>>>>>>
>>>>>>>>> This series implements a practical solution within existing kernel
>>>>>>>>> constraints:
>>>>>>>>>
>>>>>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>>>>> entire range from lowest to highest address in the batch
>>>>>>>>>
>>>>>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>>>>> which specific ranges are affected during invalidation
>>>>>>>>> callbacks,
>>>>>>>>> avoiding unnecessary processing for unrelated address changes
>>>>>>>>>
>>>>>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>>>>> restore paths, maintaining consistency with existing userptr
>>>>>>>>> handling
>>>>>>>>>
>>>>>>>>> Patch Series Overview
>>>>>>>>> =====================
>>>>>>>>>
>>>>>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data
>>>>>>>>> structures
>>>>>>>>>
>>>>>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>>>>> - user_range_info structure for per-range tracking
>>>>>>>>> - Fields for batch allocation in kgd_mem
>>>>>>>>>
>>>>>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>>>>> - Interval tree for efficient range lookup during invalidation
>>>>>>>>> - mark_invalid_ranges() function
>>>>>>>>>
>>>>>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>>>>> - Single notifier for entire VA span
>>>>>>>>> - Invalidation callback using interval tree filtering
>>>>>>>>>
>>>>>>>>> Patch 5/8: Implement batch userptr page management
>>>>>>>>> - get_user_pages_batch() and set_user_pages_batch()
>>>>>>>>> - Per-range page array management
>>>>>>>>>
>>>>>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>>>>> - init_user_pages_batch() main initialization
>>>>>>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>>>>>
>>>>>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>>>>> - Shared eviction/restore handling for batch allocations
>>>>>>>>> - Integration with existing userptr validation flows
>>>>>>>>>
>>>>>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>>>>> - Input validation and range array parsing
>>>>>>>>> - Integration with existing alloc_memory_of_gpu path
>>>>>>>>>
>>>>>>>>> Testing
>>>>>>>>> =======
>>>>>>>>>
>>>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>>>>>> - Small LLM inference (3B-7B models)
>>>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>>>
>>>>>>>>> Thank you for your review and feedback.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Honglei Huang
>>>>>>>>>
>>>>>>>>> Honglei Huang (8):
>>>>>>>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>>>>> drm/amdkfd: Implement interval tree for userptr ranges
>>>>>>>>> drm/amdkfd: Add batch MMU notifier support
>>>>>>>>> drm/amdkfd: Implement batch userptr page management
>>>>>>>>> drm/amdkfd: Add batch allocation function and export API
>>>>>>>>> drm/amdkfd: Unify userptr cleanup and update paths
>>>>>>>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>>>>>
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>>>>>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539
>>>>>>>>> +++++++++++++++++-
>>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>>>>>>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>>>>>>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>