On 2/9/26 13:52, Honglei Huang wrote: > DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses. As far as I can see that doesn't help the slightest. > My implementation follows the same pattern. The detailed comparison > of invalidation path was provided in the second half of my previous mail. Yeah and as I said that is not very valuable because it doesn't solves the sequence problem. As far as I can see the approach you try here is a clear NAK from my side. Regards, Christian. > > On 2026/2/9 18:16, Christian König wrote: >> On 2/9/26 07:14, Honglei Huang wrote: >>> >>> I've reworked the implementation in v4. The fix is actually inspired >>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c). >>> >>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track >>> multiple user virtual address ranges under a single mmu_interval_notifier, >>> and these ranges can be non-contiguous which is essentially the same >>> problem that batch userptr needs to solve: one BO backed by multiple >>> non-contiguous CPU VA ranges sharing one notifier. >> >> That still doesn't solve the sequencing problem. >> >> As far as I can see you can't use hmm_range_fault with this approach or it >> would just not be very valuable. >> >> So how should that work with your patch set? >> >> Regards, >> Christian. >> >>> >>> The wide notifier is created in drm_gpusvm_notifier_alloc: >>> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size); >>> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1; >>> The Xe driver passes >>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init >>> as the notifier_size, so one notifier can cover many of MB of VA space >>> containing multiple non-contiguous ranges. >>> >>> And DRM GPU SVM solves the per-range validity problem with flag-based >>> validation instead of seq-based validation in: >>> - drm_gpusvm_pages_valid() checks >>> flags.has_dma_mapping >>> not notifier_seq. The comment explicitly states: >>> "This is akin to a notifier seqno check in the HMM documentation >>> but due to wider notifiers (i.e., notifiers which span multiple >>> ranges) this function is required for finer grained checking" >>> - __drm_gpusvm_unmap_pages() clears >>> flags.has_dma_mapping = false under notifier_lock >>> - drm_gpusvm_get_pages() sets >>> flags.has_dma_mapping = true under notifier_lock >>> I adopted the same approach. >>> >>> DRM GPU SVM: >>> drm_gpusvm_notifier_invalidate() >>> down_write(&gpusvm->notifier_lock); >>> mmu_interval_set_seq(mni, cur_seq); >>> gpusvm->ops->invalidate() >>> -> xe_svm_invalidate() >>> drm_gpusvm_for_each_range() >>> -> __drm_gpusvm_unmap_pages() >>> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag >>> up_write(&gpusvm->notifier_lock); >>> >>> KFD batch userptr: >>> amdgpu_amdkfd_evict_userptr_batch() >>> mutex_lock(&process_info->notifier_lock); >>> mmu_interval_set_seq(mni, cur_seq); >>> discard_invalid_ranges() >>> interval_tree_iter_first/next() >>> range_info->valid = false; // clear flag >>> mutex_unlock(&process_info->notifier_lock); >>> >>> Both implementations: >>> - Acquire notifier_lock FIRST, before any flag changes >>> - Call mmu_interval_set_seq() under the lock >>> - Use interval tree to find affected ranges within the wide notifier >>> - Mark per-range flag as invalid/valid under the lock >>> >>> The page fault path and final validation path also follow the same >>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range >>> flag under the lock. >>> >>> Regards, >>> Honglei >>> >>> >>> On 2026/2/6 21:56, Christian König wrote: >>>> On 2/6/26 07:25, Honglei Huang wrote: >>>>> From: Honglei Huang <[email protected]> >>>>> >>>>> Hi all, >>>>> >>>>> This is v3 of the patch series to support allocating multiple >>>>> non-contiguous >>>>> CPU virtual address ranges that map to a single contiguous GPU virtual >>>>> address. >>>>> >>>>> v3: >>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU >>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH >>>> >>>> That is most likely not the best approach, but Felix or Philip need to >>>> comment here since I don't know such IOCTLs well either. >>>> >>>>> - When flag is set, mmap_offset field points to range array >>>>> - Minimal API surface change >>>> >>>> Why range of VA space for each entry? >>>> >>>>> 2. Improved MMU notifier handling: >>>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max] >>>>> - Interval tree for efficient lookup of affected ranges during >>>>> invalidation >>>>> - Avoids per-range notifier overhead mentioned in v2 review >>>> >>>> That won't work unless you also modify hmm_range_fault() to take multiple >>>> VA addresses (or ranges) at the same time. >>>> >>>> The problem is that we must rely on hmm_range.notifier_seq to detect >>>> changes to the page tables in question, but that in turn works only if you >>>> have one hmm_range structure and not multiple. >>>> >>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you >>>> have, but that is a bit flaky. >>>> >>>> Regards, >>>> Christian. >>>> >>>>> >>>>> 3. Better code organization: Split into 8 focused patches for easier >>>>> review >>>>> >>>>> v2: >>>>> - Each CPU VA range gets its own mmu_interval_notifier for >>>>> invalidation >>>>> - All ranges validated together and mapped to contiguous GPU VA >>>>> - Single kgd_mem object with array of user_range_info structures >>>>> - Unified eviction/restore path for all ranges in a batch >>>>> >>>>> Current Implementation Approach >>>>> =============================== >>>>> >>>>> This series implements a practical solution within existing kernel >>>>> constraints: >>>>> >>>>> 1. Single MMU notifier for VA span: Register one notifier covering the >>>>> entire range from lowest to highest address in the batch >>>>> >>>>> 2. Interval tree filtering: Use interval tree to efficiently identify >>>>> which specific ranges are affected during invalidation callbacks, >>>>> avoiding unnecessary processing for unrelated address changes >>>>> >>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and >>>>> restore paths, maintaining consistency with existing userptr handling >>>>> >>>>> Patch Series Overview >>>>> ===================== >>>>> >>>>> Patch 1/8: Add userptr batch allocation UAPI structures >>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag >>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data >>>>> structures >>>>> >>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem >>>>> - user_range_info structure for per-range tracking >>>>> - Fields for batch allocation in kgd_mem >>>>> >>>>> Patch 3/8: Implement interval tree for userptr ranges >>>>> - Interval tree for efficient range lookup during invalidation >>>>> - mark_invalid_ranges() function >>>>> >>>>> Patch 4/8: Add batch MMU notifier support >>>>> - Single notifier for entire VA span >>>>> - Invalidation callback using interval tree filtering >>>>> >>>>> Patch 5/8: Implement batch userptr page management >>>>> - get_user_pages_batch() and set_user_pages_batch() >>>>> - Per-range page array management >>>>> >>>>> Patch 6/8: Add batch allocation function and export API >>>>> - init_user_pages_batch() main initialization >>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point >>>>> >>>>> Patch 7/8: Unify userptr cleanup and update paths >>>>> - Shared eviction/restore handling for batch allocations >>>>> - Integration with existing userptr validation flows >>>>> >>>>> Patch 8/8: Wire up batch allocation in ioctl handler >>>>> - Input validation and range array parsing >>>>> - Integration with existing alloc_memory_of_gpu path >>>>> >>>>> Testing >>>>> ======= >>>>> >>>>> - Multiple scattered malloc() allocations (2-4000+ ranges) >>>>> - Various allocation sizes (4KB to 1G+ per range) >>>>> - Memory pressure scenarios and eviction/restore cycles >>>>> - OpenCL CTS and HIP catch tests in KVM guest environment >>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments >>>>> - Small LLM inference (3B-7B models) >>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal) >>>>> - Performance improvement: 2x-2.4x faster than userspace approach >>>>> >>>>> Thank you for your review and feedback. >>>>> >>>>> Best regards, >>>>> Honglei Huang >>>>> >>>>> Honglei Huang (8): >>>>> drm/amdkfd: Add userptr batch allocation UAPI structures >>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem >>>>> drm/amdkfd: Implement interval tree for userptr ranges >>>>> drm/amdkfd: Add batch MMU notifier support >>>>> drm/amdkfd: Implement batch userptr page management >>>>> drm/amdkfd: Add batch allocation function and export API >>>>> drm/amdkfd: Unify userptr cleanup and update paths >>>>> drm/amdkfd: Wire up batch allocation in ioctl handler >>>>> >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 + >>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++- >>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++- >>>>> include/uapi/linux/kfd_ioctl.h | 31 +- >>>>> 4 files changed, 697 insertions(+), 24 deletions(-) >>>>> >>>> >>> >> >
