Hi Honglei, I have to agree with Felix. Adding such complexity to the KFD API is a clear no-go from my side.
Just skimming over the patch it's obvious that this isn't correctly implemented. You simply can't the MMU notifier ranges likes this. Regards, Christian. On 1/9/26 08:55, Honglei Huang wrote: > > Hi Felix, > > Thank you for the feedback. I understand your concern about API maintenance. > > From what I can see, KFD is still the core driver for all GPU compute > workloads. The entire compute ecosystem is built on KFD's infrastructure and > continues to rely on it. While the unification work is ongoing, any > transition to DRM render node APIs would naturally take considerable time, > and KFD is expected to remain the primary interface for compute for the > foreseeable future. This batch allocation issue is affecting performance in > some specific computing scenarios. > > You're absolutely right about the API proliferation concern. Based on your > feedback, I'd like to revise the approach for v3 to minimize impact by > reusing the existing ioctl instead of adding a new API: > > - Reuse existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl > - Add one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH > - When flag is set, mmap_offset field points to range array > - No new ioctl command, no new structure > > This changes the API surface from adding a new ioctl to adding just one flag. > > Actually the implementation modifies DRM's GPU memory management > infrastructure in amdgpu_amdkfd_gpuvm.c. If DRM render node needs similar > functionality later, these functions could be directly reused. > > Would you be willing to review v3 with this approach? > > Regards, > Honglei Huang > > On 2026/1/9 03:46, Felix Kuehling wrote: >> I don't have time to review this in detail right now. I am concerned about >> adding new KFD API, when the trend is moving towards DRM render node APIs. >> This creates additional burden for ongoing support of these APIs in addition >> to the inevitable DRM render node duplicates we'll have in the future. Would >> it be possible to implement this batch userptr allocation in a render node >> API from the start? >> >> Regards, >> Felix >> >> >> On 2026-01-04 02:21, Honglei Huang wrote: >>> From: Honglei Huang <[email protected]> >>> >>> Hi all, >>> >>> This is v2 of the patch series to support allocating multiple non- >>> contiguous >>> CPU virtual address ranges that map to a single contiguous GPU virtual >>> address. >>> >>> **Key improvements over v1:** >>> - NO memory pinning: uses HMM for page tracking, pages can be swapped/ >>> migrated >>> - NO impact on SVM subsystem: avoids complexity during KFD/KGD unification >>> - Better approach: userptr's VA remapping design is ideal for scattered VA >>> registration >>> >>> Based on community feedback, v2 takes a completely different implementation >>> approach by leveraging the existing userptr infrastructure rather than >>> introducing new SVM-based mechanisms that required memory pinning. >>> >>> Changes from v1 >>> =============== >>> >>> v1 attempted to solve this problem through the SVM subsystem by: >>> - Adding a new AMDKFD_IOC_SVM_RANGES ioctl for batch SVM range registration >>> - Introducing KFD_IOCTL_SVM_ATTR_MAPPED attribute for special VMA handling >>> - Using pin_user_pages_fast() to pin scattered memory ranges >>> - Registering multiple SVM ranges with pinned pages >>> >>> This approach had significant drawbacks: >>> 1. Memory pinning defeated the purpose of HMM-based SVM's on-demand paging >>> 2. Added complexity to the SVM subsystem >>> 3. Prevented memory oversubscription and dynamic migration >>> 4. Could cause memory pressure due to locked pages >>> 5. Interfered with NUMA optimization and page migration >>> >>> v2 Implementation Approach >>> ========================== >>> >>> 1. **No memory pinning required** >>> - Uses HMM (Heterogeneous Memory Management) for page tracking >>> - Pages are NOT pinned, can be swapped/migrated when not in use >>> - Supports dynamic page eviction and on-demand restore like standard >>> userptr >>> >>> 2. **Zero impact on KFD SVM subsystem** >>> - Extends ALLOC_MEMORY_OF_GPU path, not SVM >>> - New ioctl: AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH >>> - Zero changes to SVM code, limited scope of changes >>> >>> 3. **Perfect fit for non-contiguous VA registration** >>> - Userptr design naturally supports GPU VA != CPU VA mapping >>> - Multiple non-contiguous CPU VA ranges -> single contiguous GPU VA >>> - Unlike KFD SVM which maintains VA identity, userptr allows remapping, >>> This VA remapping capability makes userptr ideal for scattered >>> allocations >>> >>> **Implementation Details:** >>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation >>> - All ranges validated together and mapped to contiguous GPU VA >>> - Single kgd_mem object with array of user_range_info structures >>> - Unified eviction/restore path for all ranges in a batch >>> >>> Patch Series Overview >>> ===================== >>> >>> Patch 1/4: Add AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH ioctl and data >>> structures >>> - New ioctl command and kfd_ioctl_userptr_range structure >>> - UAPI for userspace to request batch userptr allocation >>> >>> Patch 2/4: Extend kgd_mem for batch userptr support >>> - Add user_range_info and associated fields to kgd_mem >>> - Data structures for tracking multiple ranges per allocation >>> >>> Patch 3/4: Implement batch userptr allocation and management >>> - Core functions: init_user_pages_batch(), get_user_pages_batch() >>> - Per-range eviction/restore handlers with unified management >>> - Integration with existing userptr eviction/validation flows >>> >>> Patch 4/4: Wire up batch userptr ioctl handler >>> - Ioctl handler with input validation >>> - SVM conflict checking for GPU VA and CPU VA ranges >>> - Integration with kfd_process and process_device infrastructure >>> >>> Performance Comparison >>> ====================== >>> >>> Before implementing this patch, we attempted a userspace solution that makes >>> multiple calls to the existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl to >>> register non-contiguous VA ranges individually. This approach resulted in >>> severe performance degradation: >>> >>> **Userspace Multiple ioctl Approach:** >>> - Benchmark score: ~80,000 (down from 200,000 on bare metal) >>> - Performance loss: 60% degradation >>> >>> **This Kernel Batch ioctl Approach:** >>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal) >>> - Performance improvement: 2x-2.4x faster than userspace approach >>> - Achieves near-native performance in virtualized environments >>> >>> The batch registration in kernel avoids the repeated syscall overhead and >>> enables efficient unified management of scattered VA ranges, recovering most >>> of the performance lost to virtualization. >>> >>> Testing Results >>> =============== >>> >>> The series has been tested with: >>> - Multiple scattered malloc() allocations (2-4000+ ranges) >>> - Various allocation sizes (4KB to 1G+ per range) >>> - GPU compute workloads using the batch-allocated ranges >>> - Memory pressure scenarios and eviction/restore cycles >>> - OpenCL CTS in KVM guest environment >>> - HIP catch tests in KVM guest environment >>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments >>> - Small LLM inference (3B-7B models) using HuggingFace transformers >>> >>> Corresponding userspace patche >>> ================================ >>> Userspace ROCm changes for new ioctl: >>> - libhsakmt: https://github.com/ROCm/rocm-systems/commit/ >>> ac21716e5d6f68ec524e50eeef10d1d6ad7eae86 >>> >>> Thank you for your review and waiting for the feedback. >>> >>> Best regards, >>> Honglei Huang >>> >>> Honglei Huang (4): >>> drm/amdkfd: Add batch userptr allocation UAPI >>> drm/amdkfd: Extend kgd_mem for batch userptr support >>> drm/amdkfd: Implement batch userptr allocation and management >>> drm/amdkfd: Wire up batch userptr ioctl handler >>> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 21 + >>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 543 +++++++++++++++++- >>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 159 +++++ >>> include/uapi/linux/kfd_ioctl.h | 37 +- >>> 4 files changed, 740 insertions(+), 20 deletions(-) >>> >
