On 2/6/26 07:25, Honglei Huang wrote: > From: Honglei Huang <[email protected]> > > Hi all, > > This is v3 of the patch series to support allocating multiple non-contiguous > CPU virtual address ranges that map to a single contiguous GPU virtual > address. > > v3: > 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU > - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either. > - When flag is set, mmap_offset field points to range array > - Minimal API surface change Why range of VA space for each entry? > 2. Improved MMU notifier handling: > - Single mmu_interval_notifier covering the VA span [va_min, va_max] > - Interval tree for efficient lookup of affected ranges during invalidation > - Avoids per-range notifier overhead mentioned in v2 review That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time. The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple. What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky. Regards, Christian. > > 3. Better code organization: Split into 8 focused patches for easier review > > v2: > - Each CPU VA range gets its own mmu_interval_notifier for invalidation > - All ranges validated together and mapped to contiguous GPU VA > - Single kgd_mem object with array of user_range_info structures > - Unified eviction/restore path for all ranges in a batch > > Current Implementation Approach > =============================== > > This series implements a practical solution within existing kernel > constraints: > > 1. Single MMU notifier for VA span: Register one notifier covering the > entire range from lowest to highest address in the batch > > 2. Interval tree filtering: Use interval tree to efficiently identify > which specific ranges are affected during invalidation callbacks, > avoiding unnecessary processing for unrelated address changes > > 3. Unified eviction/restore: All ranges in a batch share eviction and > restore paths, maintaining consistency with existing userptr handling > > Patch Series Overview > ===================== > > Patch 1/8: Add userptr batch allocation UAPI structures > - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag > - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures > > Patch 2/8: Add user_range_info infrastructure to kgd_mem > - user_range_info structure for per-range tracking > - Fields for batch allocation in kgd_mem > > Patch 3/8: Implement interval tree for userptr ranges > - Interval tree for efficient range lookup during invalidation > - mark_invalid_ranges() function > > Patch 4/8: Add batch MMU notifier support > - Single notifier for entire VA span > - Invalidation callback using interval tree filtering > > Patch 5/8: Implement batch userptr page management > - get_user_pages_batch() and set_user_pages_batch() > - Per-range page array management > > Patch 6/8: Add batch allocation function and export API > - init_user_pages_batch() main initialization > - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point > > Patch 7/8: Unify userptr cleanup and update paths > - Shared eviction/restore handling for batch allocations > - Integration with existing userptr validation flows > > Patch 8/8: Wire up batch allocation in ioctl handler > - Input validation and range array parsing > - Integration with existing alloc_memory_of_gpu path > > Testing > ======= > > - Multiple scattered malloc() allocations (2-4000+ ranges) > - Various allocation sizes (4KB to 1G+ per range) > - Memory pressure scenarios and eviction/restore cycles > - OpenCL CTS and HIP catch tests in KVM guest environment > - AI workloads: Stable Diffusion, ComfyUI in virtualized environments > - Small LLM inference (3B-7B models) > - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal) > - Performance improvement: 2x-2.4x faster than userspace approach > > Thank you for your review and feedback. > > Best regards, > Honglei Huang > > Honglei Huang (8): > drm/amdkfd: Add userptr batch allocation UAPI structures > drm/amdkfd: Add user_range_info infrastructure to kgd_mem > drm/amdkfd: Implement interval tree for userptr ranges > drm/amdkfd: Add batch MMU notifier support > drm/amdkfd: Implement batch userptr page management > drm/amdkfd: Add batch allocation function and export API > drm/amdkfd: Unify userptr cleanup and update paths > drm/amdkfd: Wire up batch allocation in ioctl handler > > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 + > .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++- > include/uapi/linux/kfd_ioctl.h | 31 +- > 4 files changed, 697 insertions(+), 24 deletions(-) >
