Hi Honglei,

I have to agree with Felix. Adding such complexity to the KFD API is a clear 
no-go from my side.

Just skimming over the patch it's obvious that this isn't correctly 
implemented. You simply can't the MMU notifier ranges likes this.

Regards,
Christian. 

On 1/9/26 08:55, Honglei Huang wrote:
> 
> Hi Felix,
> 
> Thank you for the feedback. I understand your concern about API maintenance.
> 
> From what I can see, KFD is still the core driver for all GPU compute 
> workloads. The entire compute ecosystem is built on KFD's infrastructure and 
> continues to rely on it. While the unification work is ongoing, any 
> transition to DRM render node APIs would naturally take considerable time, 
> and KFD is expected to remain the primary interface for compute for the 
> foreseeable future. This batch allocation issue is affecting performance in 
> some specific computing scenarios.
> 
> You're absolutely right about the API proliferation concern. Based on your 
> feedback, I'd like to revise the approach for v3 to minimize impact by 
> reusing the existing ioctl instead of adding a new API:
> 
> - Reuse existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl
> - Add one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
> - When flag is set, mmap_offset field points to range array
> - No new ioctl command, no new structure
> 
> This changes the API surface from adding a new ioctl to adding just one flag.
> 
> Actually the implementation modifies DRM's GPU memory management
> infrastructure in amdgpu_amdkfd_gpuvm.c. If DRM render node needs similar 
> functionality later, these functions could be directly reused.
> 
> Would you be willing to review v3 with this approach?
> 
> Regards,
> Honglei Huang
> 
> On 2026/1/9 03:46, Felix Kuehling wrote:
>> I don't have time to review this in detail right now. I am concerned about 
>> adding new KFD API, when the trend is moving towards DRM render node APIs. 
>> This creates additional burden for ongoing support of these APIs in addition 
>> to the inevitable DRM render node duplicates we'll have in the future. Would 
>> it be possible to implement this batch userptr allocation in a render node 
>> API from the start?
>>
>> Regards,
>>    Felix
>>
>>
>> On 2026-01-04 02:21, Honglei Huang wrote:
>>> From: Honglei Huang <[email protected]>
>>>
>>> Hi all,
>>>
>>> This is v2 of the patch series to support allocating multiple non- 
>>> contiguous
>>> CPU virtual address ranges that map to a single contiguous GPU virtual 
>>> address.
>>>
>>> **Key improvements over v1:**
>>> - NO memory pinning: uses HMM for page tracking, pages can be swapped/ 
>>> migrated
>>> - NO impact on SVM subsystem: avoids complexity during KFD/KGD unification
>>> - Better approach: userptr's VA remapping design is ideal for scattered VA 
>>> registration
>>>
>>> Based on community feedback, v2 takes a completely different implementation
>>> approach by leveraging the existing userptr infrastructure rather than
>>> introducing new SVM-based mechanisms that required memory pinning.
>>>
>>> Changes from v1
>>> ===============
>>>
>>> v1 attempted to solve this problem through the SVM subsystem by:
>>> - Adding a new AMDKFD_IOC_SVM_RANGES ioctl for batch SVM range registration
>>> - Introducing KFD_IOCTL_SVM_ATTR_MAPPED attribute for special VMA handling
>>> - Using pin_user_pages_fast() to pin scattered memory ranges
>>> - Registering multiple SVM ranges with pinned pages
>>>
>>> This approach had significant drawbacks:
>>> 1. Memory pinning defeated the purpose of HMM-based SVM's on-demand paging
>>> 2. Added complexity to the SVM subsystem
>>> 3. Prevented memory oversubscription and dynamic migration
>>> 4. Could cause memory pressure due to locked pages
>>> 5. Interfered with NUMA optimization and page migration
>>>
>>> v2 Implementation Approach
>>> ==========================
>>>
>>> 1. **No memory pinning required**
>>>     - Uses HMM (Heterogeneous Memory Management) for page tracking
>>>     - Pages are NOT pinned, can be swapped/migrated when not in use
>>>     - Supports dynamic page eviction and on-demand restore like standard 
>>> userptr
>>>
>>> 2. **Zero impact on KFD SVM subsystem**
>>>     - Extends ALLOC_MEMORY_OF_GPU path, not SVM
>>>     - New ioctl: AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH
>>>     - Zero changes to SVM code, limited scope of changes
>>>
>>> 3. **Perfect fit for non-contiguous VA registration**
>>>     - Userptr design naturally supports GPU VA != CPU VA mapping
>>>     - Multiple non-contiguous CPU VA ranges -> single contiguous GPU VA
>>>     - Unlike KFD SVM which maintains VA identity, userptr allows remapping,
>>>       This VA remapping capability makes userptr ideal for scattered 
>>> allocations
>>>
>>> **Implementation Details:**
>>>     - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>     - All ranges validated together and mapped to contiguous GPU VA
>>>     - Single kgd_mem object with array of user_range_info structures
>>>     - Unified eviction/restore path for all ranges in a batch
>>>
>>> Patch Series Overview
>>> =====================
>>>
>>> Patch 1/4: Add AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH ioctl and data 
>>> structures
>>>      - New ioctl command and kfd_ioctl_userptr_range structure
>>>      - UAPI for userspace to request batch userptr allocation
>>>
>>> Patch 2/4: Extend kgd_mem for batch userptr support
>>>      - Add user_range_info and associated fields to kgd_mem
>>>      - Data structures for tracking multiple ranges per allocation
>>>
>>> Patch 3/4: Implement batch userptr allocation and management
>>>      - Core functions: init_user_pages_batch(), get_user_pages_batch()
>>>      - Per-range eviction/restore handlers with unified management
>>>      - Integration with existing userptr eviction/validation flows
>>>
>>> Patch 4/4: Wire up batch userptr ioctl handler
>>>      - Ioctl handler with input validation
>>>      - SVM conflict checking for GPU VA and CPU VA ranges
>>>      - Integration with kfd_process and process_device infrastructure
>>>
>>> Performance Comparison
>>> ======================
>>>
>>> Before implementing this patch, we attempted a userspace solution that makes
>>> multiple calls to the existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl to
>>> register non-contiguous VA ranges individually. This approach resulted in
>>> severe performance degradation:
>>>
>>> **Userspace Multiple ioctl Approach:**
>>> - Benchmark score: ~80,000 (down from 200,000 on bare metal)
>>> - Performance loss: 60% degradation
>>>
>>> **This Kernel Batch ioctl Approach:**
>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>> - Achieves near-native performance in virtualized environments
>>>
>>> The batch registration in kernel avoids the repeated syscall overhead and
>>> enables efficient unified management of scattered VA ranges, recovering most
>>> of the performance lost to virtualization.
>>>
>>> Testing Results
>>> ===============
>>>
>>> The series has been tested with:
>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>> - Various allocation sizes (4KB to 1G+ per range)
>>> - GPU compute workloads using the batch-allocated ranges
>>> - Memory pressure scenarios and eviction/restore cycles
>>> - OpenCL CTS in KVM guest environment
>>> - HIP catch tests in KVM guest environment
>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>> - Small LLM inference (3B-7B models) using HuggingFace transformers
>>>
>>> Corresponding userspace patche
>>> ================================
>>> Userspace ROCm changes for new ioctl:
>>> - libhsakmt: https://github.com/ROCm/rocm-systems/commit/ 
>>> ac21716e5d6f68ec524e50eeef10d1d6ad7eae86
>>>
>>> Thank you for your review and waiting for the feedback.
>>>
>>> Best regards,
>>> Honglei Huang
>>>
>>> Honglei Huang (4):
>>>    drm/amdkfd: Add batch userptr allocation UAPI
>>>    drm/amdkfd: Extend kgd_mem for batch userptr support
>>>    drm/amdkfd: Implement batch userptr allocation and management
>>>    drm/amdkfd: Wire up batch userptr ioctl handler
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  21 +
>>>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 543 +++++++++++++++++-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 159 +++++
>>>   include/uapi/linux/kfd_ioctl.h                |  37 +-
>>>   4 files changed, 740 insertions(+), 20 deletions(-)
>>>
> 

Reply via email to