12] POC SVM implementation in AMDGPU based on drm_gpusvm

Honglei Huang Wed, 18 Mar 2026 01:59:51 -0700



On 3/17/26 19:48, Christian König wrote:

Adding a few XE and drm_gpuvm people on TO.

On 3/17/26 12:29, Honglei Huang wrote:

From: Honglei Huang <[email protected]>

This is a POC/draft patch series of SVM feature in amdgpu based on the
drm_gpusvm framework. The primary purpose of this RFC is to validate
the framework's applicability, identify implementation challenges,
and start discussion on framework evolution. This is not a production
ready submission.

This patch series implements basic SVM support with the following features:

   1. attributes sepatarated from physical page management:

     - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
       tree that stores SVM attributes. Managed through the SET_ATTR,
       and mmu notifier callback.

     - Physical page layer (drm_gpusvm ranges): managed by the
       drm_gpusvm framework, representing actual HMM backed DMA
       mappings and GPU page table entries.

      This separation is necessary:
        -  The framework does not support range splitting, so a partial
           munmap destroys the entire overlapping range, including the
           still valid parts. If attributes were stored inside drm_gpusvm
           ranges, they would be lost on unmapping.
           The separate attr tree preserves userspace set attributes
           across range operations.


Isn't that actually intended? When parts of the range unmap then that usually 
means the whole range isn't valid any more.

It is about partial unmap, some subregion in drm_gpusvm_range is stillvalid but some other subregion is invalid, but under drm_gpusvm, need todestroy the entire range.


e.g.:

          [---------------unmap region in mmu notifier-----------------]
[0x1000 ------------ 0x9000]
[  valid ][     invalid    ]

see deatil in drm_gpusvm.c:110 line
section:Partial Unmapping of Ranges


        -  drm_gpusvm range boundaries are determined by fault address
           and pre setted chunk size, not by userspace attribute boundaries.
           Ranges  may be rechunked on memory changes. Embedding
           attributes in framework ranges would scatter attr state
           across many small ranges and require complex reassemble
           logic when operate attrbute.


Yeah, that makes a lot of sense.


   2) System memory mapping via drm_gpusvm

      The core mapping path uses drm_gpusvm_range_find_or_insert() to
      create ranges, drm_gpusvm_range_get_pages() for HMM page fault
      and DMA mapping, then updates GPU page tables via
      amdgpu_vm_update_range().

   3) IOCTL driven mapping (XNACK off / no GPU fault mode)

      On XNACK off hardware the GPU cannot recover from page faults,
      so mappings must be established through ioctl. When
      userspace calls SET_ATTR with ACCESS=ENABLE, the driver
      walks the attr tree and maps all accessible intervals
      to the GPU by amdgpu_svm_range_map_attr_ranges().

   4) Invalidation, GC worker, and restore worker

      MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
      three cases based on event type and hardware mode:
        - unmap event: clear GPU PTEs in the notifier context,
          unmap DMA pages, mark ranges as unmapped, flush TLB,
          and enqueue to the GC worker. On XNACK off, also
          quiesce KFD queues and schedule rebuild of the
          still valid portions that were destroyed together with
          the unmapped subregion.

        - evict on XNACK off:
          quiesce KFD queues first, then unmap DMA pages and
          enqueue to the restore worker.


Is that done through the DMA fence or by talking directly to the MES/HWS?

Currently KFD queues quiesce/resume API are reused, lookig forward to abetter solution.


Regards,
Honglei


Thanks,
Christian.


        - evict on XNACK on:
          clear GPU PTEs, unmap DMA pages, and flush TLB, but do
          not schedule any worker. The GPU will fault on next
          access and the fault handler establishes the mapping.

Not supported feature:
   - XNACK on GPU page fault mode
   - migration and prefetch feature
   - Multi GPU support

   XNACK on enablement is ongoing.The GPUs that support XNACK on
   are currently only accessible to us via remote lab machines, which slows
   down progress.

Patch overview:

   01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
         SET_ATTR/GET_ATTR operations, attribute types, and related
         structs in amdgpu_drm.h.

   02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
         refcount, attr_tree, workqueues, locks, and
         callbacks (begin/end_restore, flush_tlb).

   03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
         (interval tree node), attr_tree, access enum, flag masks,
         and change trigger enum.

   04/12 Attribute tree operations: interval tree lookup, insert,
         remove, and tree create/destroy lifecycle.

   05/12 Attribute set: validate UAPI attributes, apply to internal
         attrs, handle hole/existing range with head/tail splitting,
         compute change triggers, and -EAGAIN retry loop.
         Implements attr_clear_pages for unmap cleanup and attr_get.

   06/12 Range data structures: amdgpu_svm_range extending
         drm_gpusvm_range with gpu_mapped state, pending ops,
         pte_flags cache, and GC/restore queue linkage.

   07/12 PTE flags and GPU mapping: simple gpu pte function,
         GPU page table update with DMA address, range mapping loop:
         find_or_insert -> get_pages -> validate -> update PTE,
         and attribute change driven mapping function.

   08/12 Notifier and invalidation: synchronous GPU PTE clear in
         notifier context, range removal and overlap cleanup,
         rebuild after destroy logic, and MMU event dispatcher

   09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
         worker for unmap processing and rebuild, ordered restore
         worker for mapping evicted ranges, and flush/sync
         helpers.

   10/12 Initialization and fini: kmem_cache for range/attr,
         drm_gpusvm_init with chunk sizes, XNACK detection, TLB
         flush helper, and amdgpu_svm init/close/fini lifecycle.

   11/12 IOCTL and fault handler: PASID based SVM lookup with kref
         protection, amdgpu_gem_svm_ioctl dispatcher, and
         amdgpu_svm_handle_fault for GPU page fault recovery.

   12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
         Makefile rules, ioctl table registration, and amdgpu_vm
         hooks (init in make_compute, close/fini, fault dispatch).

Test result:
   on gfx1100(W7900) and gfx943(MI300x)
   kfd test: 95%+ passed, same failed cases with offical relase
   rocr test: all passed
   hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical 
relase

During implementation we identified several challenges / design questions:

1. No range splitting on partial unmap

   drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
   Partial munmap needs to destroy the entire range including the valid 
interval.
   GPU fault driven hardware can handle this design by extra gpu fault handle,
   but AMDGPU needs to support XNACK off hardware, this design requires driver
   rebuild the valid part in the removed entire range. Whichs bring a very heavy
   restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and 
map)
   this restore work even heavier than kfd_svm. In previous driver work queue
   only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and 
restore.
   which brings about more complex logic, heavier worker queue workload, and
   synchronization issues.

2. Fault driven vs ioctl driven mapping

   drm_gpusvm is designed around GPU page fault handlers. The primary entry
   point drm_gpusvm_range_find_or_insert() takes a fault_addr.
   AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
   GPU cannot fault at all

   The ioctl path cannot hold mmap_read_lock across the entire operation
   because drm_gpusvm_range_find_or_insert() acquires/releases it
   internally. This creates race windows with MMU notifiers / workers.

3. Multi GPU support

drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
each GPU gets an independent instance with its own range tree, MMU
notifiers, notifier_lock, and DMA mappings.

This may brings huge overhead:
     - N x MMU notifier registrations for the same address range
     - N x hmm_range_fault() calls for the same page (KFD: 1x)
     - N x DMA mapping memory
     - N x invalidation + restore worker scheduling per CPU unmap event
     - N x GPU page table flush / TLB invalidation
     - Increased mmap_lock hold time, N callbacks serialize under it

compatibility issues:
     - Quiesce/resume scope mismatch: to integrate with KFD compute
       queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
       which have process level semantics. Under the per GPU
       drm_gpusvm model, maybe there are some issues on sync. To properly
       integrate with KFD under the per SVM model, a compatibility or
       new per VM level queue control APIs maybe need to introduced.

Migration challenges:

   - No global migration decision logic: each per GPU SVM
     instance maintains its own attribute tree independently. This
     allows conflicting settings (e.g., GPU0's SVM sets
     PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
     for the same address range) with no detection or resolution.
     A global attribute coordinator or a shared manager is needed to
     provide a unified global view for migration decisions

   - migrate_vma_setup broadcast: one GPU's migration triggers MMU
     notifier callbacks in ALL N-1 other drm_gpusvm instances,
     causing N-1 unnecessary restore workers to be scheduled. And
     creates races between the initiating migration and the other
     instance's restore attempts.

   - No cross instance migration serialization: each per GPU
     drm_gpusvm instance has independent locking, so two GPUs'
     "decide -> migrate -> remap" sequences can interleave. While
     the kernel page lock prevents truly simultaneous migration of
     the same physical page, the losing side's retry (evict from
     other GPU's VRAM -> migrate back) triggers broadcast notifier
     invalidations and restore workers, compounding the ping pong
     problem above.

   - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
     hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
     it only selects system memory pages for migration.

   - CPU fault reverse migration race: CPU page fault triggers
     migrate_to_ram while GPU instances are concurrently operating.
     Per GPU notifier_lock does not protect cross GPU operations.

We believe a strong, well designed solution at the framework level is
needed to properly address these problems, and we look forward to
discussion and suggestions.

Honglei Huang (12):
   drm/amdgpu: add SVM UAPI definitions
   drm/amdgpu: add SVM data structures and header
   drm/amdgpu: add SVM attribute data structures
   drm/amdgpu: implement SVM attribute tree operations
   drm/amdgpu: implement SVM attribute set
   drm/amdgpu: add SVM range data structures
   drm/amdgpu: implement SVM range PTE flags and GPU mapping
   drm/amdgpu: implement SVM range notifier and invalidation
   drm/amdgpu: implement SVM range workers
   drm/amdgpu: implement SVM core initialization and fini
   drm/amdgpu: implement SVM ioctl and fault handler
   drm/amdgpu: wire up SVM build system and fault handler

  drivers/gpu/drm/amd/amdgpu/Kconfig            |   11 +
  drivers/gpu/drm/amd/amdgpu/Makefile           |   13 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |    2 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c       |  430 ++++++
  drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h       |  147 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c  |  894 ++++++++++++
  drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h  |  110 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
  drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h |   76 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   40 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |    4 +
  include/uapi/drm/amdgpu_drm.h                 |   39 +
  12 files changed, 2958 insertions(+), 4 deletions(-)
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h


base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449

Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm

Reply via email to