12] POC SVM implementation in AMDGPU based on drm_gpusvm

Christian König Tue, 17 Mar 2026 04:48:43 -0700

Adding a few XE and drm_gpuvm people on TO.

On 3/17/26 12:29, Honglei Huang wrote:
> From: Honglei Huang <[email protected]>
> 
> This is a POC/draft patch series of SVM feature in amdgpu based on the 
> drm_gpusvm framework. The primary purpose of this RFC is to validate
> the framework's applicability, identify implementation challenges, 
> and start discussion on framework evolution. This is not a production 
> ready submission.
> 
> This patch series implements basic SVM support with the following features:
> 
>   1. attributes sepatarated from physical page management:
> 
>     - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
>       tree that stores SVM attributes. Managed through the SET_ATTR,
>       and mmu notifier callback.
> 
>     - Physical page layer (drm_gpusvm ranges): managed by the
>       drm_gpusvm framework, representing actual HMM backed DMA
>       mappings and GPU page table entries.
> 
>      This separation is necessary:
>        -  The framework does not support range splitting, so a partial 
>           munmap destroys the entire overlapping range, including the 
>           still valid parts. If attributes were stored inside drm_gpusvm
>           ranges, they would be lost on unmapping.
>           The separate attr tree preserves userspace set attributes
>           across range operations.


Isn't that actually intended? When parts of the range unmap then that usually 
means the whole range isn't valid any more.

> 
>        -  drm_gpusvm range boundaries are determined by fault address
>           and pre setted chunk size, not by userspace attribute boundaries.
>           Ranges  may be rechunked on memory changes. Embedding
>           attributes in framework ranges would scatter attr state
>           across many small ranges and require complex reassemble
>           logic when operate attrbute.

Yeah, that makes a lot of sense.

> 
>   2) System memory mapping via drm_gpusvm
> 
>      The core mapping path uses drm_gpusvm_range_find_or_insert() to
>      create ranges, drm_gpusvm_range_get_pages() for HMM page fault
>      and DMA mapping, then updates GPU page tables via
>      amdgpu_vm_update_range().
> 
>   3) IOCTL driven mapping (XNACK off / no GPU fault mode)
> 
>      On XNACK off hardware the GPU cannot recover from page faults,
>      so mappings must be established through ioctl. When
>      userspace calls SET_ATTR with ACCESS=ENABLE, the driver 
>      walks the attr tree and maps all accessible intervals 
>      to the GPU by amdgpu_svm_range_map_attr_ranges(). 
> 
>   4) Invalidation, GC worker, and restore worker
> 
>      MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
>      three cases based on event type and hardware mode:
>        - unmap event: clear GPU PTEs in the notifier context,
>          unmap DMA pages, mark ranges as unmapped, flush TLB,
>          and enqueue to the GC worker. On XNACK off, also
>          quiesce KFD queues and schedule rebuild of the
>          still valid portions that were destroyed together with
>          the unmapped subregion.
> 
>        - evict on XNACK off:
>          quiesce KFD queues first, then unmap DMA pages and
>          enqueue to the restore worker.

Is that done through the DMA fence or by talking directly to the MES/HWS?

Thanks,
Christian.

> 
>        - evict on XNACK on:
>          clear GPU PTEs, unmap DMA pages, and flush TLB, but do
>          not schedule any worker. The GPU will fault on next
>          access and the fault handler establishes the mapping.
> 
> Not supported feature:
>   - XNACK on GPU page fault mode
>   - migration and prefetch feature
>   - Multi GPU support
> 
>   XNACK on enablement is ongoing.The GPUs that support XNACK on 
>   are currently only accessible to us via remote lab machines, which slows
>   down progress.
> 
> Patch overview:
> 
>   01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
>         SET_ATTR/GET_ATTR operations, attribute types, and related
>         structs in amdgpu_drm.h.
> 
>   02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
>         refcount, attr_tree, workqueues, locks, and
>         callbacks (begin/end_restore, flush_tlb).
> 
>   03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
>         (interval tree node), attr_tree, access enum, flag masks,
>         and change trigger enum.
> 
>   04/12 Attribute tree operations: interval tree lookup, insert,
>         remove, and tree create/destroy lifecycle.
> 
>   05/12 Attribute set: validate UAPI attributes, apply to internal
>         attrs, handle hole/existing range with head/tail splitting,
>         compute change triggers, and -EAGAIN retry loop.
>         Implements attr_clear_pages for unmap cleanup and attr_get.
> 
>   06/12 Range data structures: amdgpu_svm_range extending
>         drm_gpusvm_range with gpu_mapped state, pending ops,
>         pte_flags cache, and GC/restore queue linkage.
> 
>   07/12 PTE flags and GPU mapping: simple gpu pte function,
>         GPU page table update with DMA address, range mapping loop:
>         find_or_insert -> get_pages -> validate -> update PTE,
>         and attribute change driven mapping function.
> 
>   08/12 Notifier and invalidation: synchronous GPU PTE clear in
>         notifier context, range removal and overlap cleanup,
>         rebuild after destroy logic, and MMU event dispatcher
> 
>   09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
>         worker for unmap processing and rebuild, ordered restore
>         worker for mapping evicted ranges, and flush/sync
>         helpers.
> 
>   10/12 Initialization and fini: kmem_cache for range/attr,
>         drm_gpusvm_init with chunk sizes, XNACK detection, TLB
>         flush helper, and amdgpu_svm init/close/fini lifecycle.
> 
>   11/12 IOCTL and fault handler: PASID based SVM lookup with kref
>         protection, amdgpu_gem_svm_ioctl dispatcher, and
>         amdgpu_svm_handle_fault for GPU page fault recovery.
> 
>   12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
>         Makefile rules, ioctl table registration, and amdgpu_vm
>         hooks (init in make_compute, close/fini, fault dispatch).
> 
> Test result:
>   on gfx1100(W7900) and gfx943(MI300x)
>   kfd test: 95%+ passed, same failed cases with offical relase
>   rocr test: all passed
>   hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical 
> relase
> 
> During implementation we identified several challenges / design questions:
> 
> 1. No range splitting on partial unmap
> 
>   drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
>   Partial munmap needs to destroy the entire range including the valid 
> interval.
>   GPU fault driven hardware can handle this design by extra gpu fault handle,
>   but AMDGPU needs to support XNACK off hardware, this design requires driver 
>   rebuild the valid part in the removed entire range. Whichs bring a very 
> heavy
>   restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and 
> map)
>   this restore work even heavier than kfd_svm. In previous driver work queue 
>   only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and 
> restore.
>   which brings about more complex logic, heavier worker queue workload, and 
>   synchronization issues.
> 
> 2. Fault driven vs ioctl driven mapping
> 
>   drm_gpusvm is designed around GPU page fault handlers. The primary entry
>   point drm_gpusvm_range_find_or_insert() takes a fault_addr.
>   AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
>   GPU cannot fault at all
> 
>   The ioctl path cannot hold mmap_read_lock across the entire operation
>   because drm_gpusvm_range_find_or_insert() acquires/releases it
>   internally. This creates race windows with MMU notifiers / workers.
> 
> 3. Multi GPU support
> 
> drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
> each GPU gets an independent instance with its own range tree, MMU
> notifiers, notifier_lock, and DMA mappings.
> 
> This may brings huge overhead:
>     - N x MMU notifier registrations for the same address range
>     - N x hmm_range_fault() calls for the same page (KFD: 1x)
>     - N x DMA mapping memory
>     - N x invalidation + restore worker scheduling per CPU unmap event
>     - N x GPU page table flush / TLB invalidation
>     - Increased mmap_lock hold time, N callbacks serialize under it
> 
> compatibility issues:
>     - Quiesce/resume scope mismatch: to integrate with KFD compute
>       queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
>       which have process level semantics. Under the per GPU 
>       drm_gpusvm model, maybe there are some issues on sync. To properly
>       integrate with KFD under the per SVM model, a compatibility or 
>       new per VM level queue control APIs maybe need to introduced.
> 
> Migration challenges:
> 
>   - No global migration decision logic: each per GPU SVM
>     instance maintains its own attribute tree independently. This
>     allows conflicting settings (e.g., GPU0's SVM sets
>     PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
>     for the same address range) with no detection or resolution.
>     A global attribute coordinator or a shared manager is needed to
>     provide a unified global view for migration decisions
> 
>   - migrate_vma_setup broadcast: one GPU's migration triggers MMU
>     notifier callbacks in ALL N-1 other drm_gpusvm instances,
>     causing N-1 unnecessary restore workers to be scheduled. And 
>     creates races between the initiating migration and the other
>     instance's restore attempts.
> 
>   - No cross instance migration serialization: each per GPU
>     drm_gpusvm instance has independent locking, so two GPUs'
>     "decide -> migrate -> remap" sequences can interleave. While
>     the kernel page lock prevents truly simultaneous migration of
>     the same physical page, the losing side's retry (evict from
>     other GPU's VRAM -> migrate back) triggers broadcast notifier
>     invalidations and restore workers, compounding the ping pong
>     problem above.
> 
>   - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
>     hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
>     it only selects system memory pages for migration.
> 
>   - CPU fault reverse migration race: CPU page fault triggers
>     migrate_to_ram while GPU instances are concurrently operating.
>     Per GPU notifier_lock does not protect cross GPU operations.
> 
> We believe a strong, well designed solution at the framework level is
> needed to properly address these problems, and we look forward to 
> discussion and suggestions.
> 
> Honglei Huang (12):
>   drm/amdgpu: add SVM UAPI definitions
>   drm/amdgpu: add SVM data structures and header
>   drm/amdgpu: add SVM attribute data structures
>   drm/amdgpu: implement SVM attribute tree operations
>   drm/amdgpu: implement SVM attribute set
>   drm/amdgpu: add SVM range data structures
>   drm/amdgpu: implement SVM range PTE flags and GPU mapping
>   drm/amdgpu: implement SVM range notifier and invalidation
>   drm/amdgpu: implement SVM range workers
>   drm/amdgpu: implement SVM core initialization and fini
>   drm/amdgpu: implement SVM ioctl and fault handler
>   drm/amdgpu: wire up SVM build system and fault handler
> 
>  drivers/gpu/drm/amd/amdgpu/Kconfig            |   11 +
>  drivers/gpu/drm/amd/amdgpu/Makefile           |   13 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |    2 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c       |  430 ++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h       |  147 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c  |  894 ++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h  |  110 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h |   76 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   40 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |    4 +
>  include/uapi/drm/amdgpu_drm.h                 |   39 +
>  12 files changed, 2958 insertions(+), 4 deletions(-)
>  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
>  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
>  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
>  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
>  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
>  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
> 
> 
> base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449

Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm

Reply via email to