From: Honglei Huang <[email protected]>
Hi all,
This is v2 of the patch series to support allocating multiple non-contiguous
CPU virtual address ranges that map to a single contiguous GPU virtual address.
**Key improvements over v1:**
- NO memory pinning: uses HMM for page tracking, pages can be swapped/migrated
- NO impact on SVM subsystem: avoids complexity during KFD/KGD unification
- Better approach: userptr's VA remapping design is ideal for scattered VA
registration
Based on community feedback, v2 takes a completely different implementation
approach by leveraging the existing userptr infrastructure rather than
introducing new SVM-based mechanisms that required memory pinning.
Changes from v1
===============
v1 attempted to solve this problem through the SVM subsystem by:
- Adding a new AMDKFD_IOC_SVM_RANGES ioctl for batch SVM range registration
- Introducing KFD_IOCTL_SVM_ATTR_MAPPED attribute for special VMA handling
- Using pin_user_pages_fast() to pin scattered memory ranges
- Registering multiple SVM ranges with pinned pages
This approach had significant drawbacks:
1. Memory pinning defeated the purpose of HMM-based SVM's on-demand paging
2. Added complexity to the SVM subsystem
3. Prevented memory oversubscription and dynamic migration
4. Could cause memory pressure due to locked pages
5. Interfered with NUMA optimization and page migration
v2 Implementation Approach
==========================
1. **No memory pinning required**
- Uses HMM (Heterogeneous Memory Management) for page tracking
- Pages are NOT pinned, can be swapped/migrated when not in use
- Supports dynamic page eviction and on-demand restore like standard userptr
2. **Zero impact on KFD SVM subsystem**
- Extends ALLOC_MEMORY_OF_GPU path, not SVM
- New ioctl: AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH
- Zero changes to SVM code, limited scope of changes
3. **Perfect fit for non-contiguous VA registration**
- Userptr design naturally supports GPU VA != CPU VA mapping
- Multiple non-contiguous CPU VA ranges -> single contiguous GPU VA
- Unlike KFD SVM which maintains VA identity, userptr allows remapping,
This VA remapping capability makes userptr ideal for scattered allocations
**Implementation Details:**
- Each CPU VA range gets its own mmu_interval_notifier for invalidation
- All ranges validated together and mapped to contiguous GPU VA
- Single kgd_mem object with array of user_range_info structures
- Unified eviction/restore path for all ranges in a batch
Patch Series Overview
=====================
Patch 1/4: Add AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH ioctl and data structures
- New ioctl command and kfd_ioctl_userptr_range structure
- UAPI for userspace to request batch userptr allocation
Patch 2/4: Extend kgd_mem for batch userptr support
- Add user_range_info and associated fields to kgd_mem
- Data structures for tracking multiple ranges per allocation
Patch 3/4: Implement batch userptr allocation and management
- Core functions: init_user_pages_batch(), get_user_pages_batch()
- Per-range eviction/restore handlers with unified management
- Integration with existing userptr eviction/validation flows
Patch 4/4: Wire up batch userptr ioctl handler
- Ioctl handler with input validation
- SVM conflict checking for GPU VA and CPU VA ranges
- Integration with kfd_process and process_device infrastructure
Performance Comparison
======================
Before implementing this patch, we attempted a userspace solution that makes
multiple calls to the existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl to
register non-contiguous VA ranges individually. This approach resulted in
severe performance degradation:
**Userspace Multiple ioctl Approach:**
- Benchmark score: ~80,000 (down from 200,000 on bare metal)
- Performance loss: 60% degradation
**This Kernel Batch ioctl Approach:**
- Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
- Performance improvement: 2x-2.4x faster than userspace approach
- Achieves near-native performance in virtualized environments
The batch registration in kernel avoids the repeated syscall overhead and
enables efficient unified management of scattered VA ranges, recovering most
of the performance lost to virtualization.
Testing Results
===============
The series has been tested with:
- Multiple scattered malloc() allocations (2-4000+ ranges)
- Various allocation sizes (4KB to 1G+ per range)
- GPU compute workloads using the batch-allocated ranges
- Memory pressure scenarios and eviction/restore cycles
- OpenCL CTS in KVM guest environment
- HIP catch tests in KVM guest environment
- AI workloads: Stable Diffusion, ComfyUI in virtualized environments
- Small LLM inference (3B-7B models) using HuggingFace transformers
Corresponding userspace patche
================================
Userspace ROCm changes for new ioctl:
- libhsakmt:
https://github.com/ROCm/rocm-systems/commit/ac21716e5d6f68ec524e50eeef10d1d6ad7eae86
Thank you for your review and waiting for the feedback.
Best regards,
Honglei Huang
Honglei Huang (4):
drm/amdkfd: Add batch userptr allocation UAPI
drm/amdkfd: Extend kgd_mem for batch userptr support
drm/amdkfd: Implement batch userptr allocation and management
drm/amdkfd: Wire up batch userptr ioctl handler
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 21 +
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 543 +++++++++++++++++-
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 159 +++++
include/uapi/linux/kfd_ioctl.h | 37 +-
4 files changed, 740 insertions(+), 20 deletions(-)
--
2.34.1