12] POC SVM implementation in AMDGPU based on drm_gpusvm

Matthew Brost Wed, 18 Mar 2026 22:08:53 -0700

On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
> 

Disclaimer I haven't look at any code in this series yet.


> 
> On 3/17/26 19:48, Christian König wrote:
> > Adding a few XE and drm_gpuvm people on TO.
> > 
> > On 3/17/26 12:29, Honglei Huang wrote:
> > > From: Honglei Huang <[email protected]>
> > > 
> > > This is a POC/draft patch series of SVM feature in amdgpu based on the
> > > drm_gpusvm framework. The primary purpose of this RFC is to validate
> > > the framework's applicability, identify implementation challenges,
> > > and start discussion on framework evolution. This is not a production

+1. Open to any ideas. Given this was designed originally for Xe we very
well could have missed other drivers requirements.

> > > ready submission.
> > > 
> > > This patch series implements basic SVM support with the following 
> > > features:
> > > 
> > >    1. attributes sepatarated from physical page management:
> > > 
> > >      - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
> > >        tree that stores SVM attributes. Managed through the SET_ATTR,
> > >        and mmu notifier callback.

Can you explain the mmu notifier callback interaction here? See below in
Xe the attribute tree is existing VMA tree (gpuvm).

> > > 
> > >      - Physical page layer (drm_gpusvm ranges): managed by the
> > >        drm_gpusvm framework, representing actual HMM backed DMA
> > >        mappings and GPU page table entries.
> > > 
> > >       This separation is necessary:
> > >         -  The framework does not support range splitting, so a partial
> > >            munmap destroys the entire overlapping range, including the
> > >            still valid parts. If attributes were stored inside drm_gpusvm
> > >            ranges, they would be lost on unmapping.
> > >            The separate attr tree preserves userspace set attributes
> > >            across range operations.

Yes, in Xe the divide is at the VMA level (set by user space) via VM
bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
madvise IOCTLs which reflect user space attributes on current SVM
mappings or future ones.

The SVM range tree reflects mappings that have been faulted into the
device and contain pages. This is an intentional choice.

> > 
> > Isn't that actually intended? When parts of the range unmap then that 
> > usually means the whole range isn't valid any more.


Yes, this was an intentional design choice to not support partial unmap,
and instead rely on the driver to recreate a new range.

The reasoning is:

- In practice, this should be rare for well-behaved applications.

- With THP / large device pages, if a sub-range is unmapped, the entire
GPU mapping is invalidated anyway due to the page size change. As a
result, the cost of creating a new range is minimal, since the device
will likely fault again on the remaining pages.

So there is no need to over-engineer the common code.

FWIW, to even test partial unmaps in Xe, I had to do things I doubt
anyone would ever do:

ptr = mmap(SZ_2M);
/* fault in memory to the device */
munmap(ptr, SZ_1M);
/* touch memory again on the device */

> 
> 
> It is about partial unmap, some subregion in drm_gpusvm_range is still valid
> but some other subregion is invalid, but under drm_gpusvm, need to destroy
> the entire range.
> 
> e.g.:
> 
>           [---------------unmap region in mmu notifier-----------------]
> [0x1000 ------------ 0x9000]
> [  valid ][     invalid    ]
> 
> see deatil in drm_gpusvm.c:110 line
> section:Partial Unmapping of Ranges
> 
> 
> > 
> > > 
> > >         -  drm_gpusvm range boundaries are determined by fault address
> > >            and pre setted chunk size, not by userspace attribute 
> > > boundaries.
> > >            Ranges  may be rechunked on memory changes. Embedding
> > >            attributes in framework ranges would scatter attr state
> > >            across many small ranges and require complex reassemble
> > >            logic when operate attrbute.
> > 
> > Yeah, that makes a lot of sense.
> > 
> > > 
> > >    2) System memory mapping via drm_gpusvm
> > > 
> > >       The core mapping path uses drm_gpusvm_range_find_or_insert() to
> > >       create ranges, drm_gpusvm_range_get_pages() for HMM page fault
> > >       and DMA mapping, then updates GPU page tables via
> > >       amdgpu_vm_update_range().
> > > 
> > >    3) IOCTL driven mapping (XNACK off / no GPU fault mode)
> > > 
> > >       On XNACK off hardware the GPU cannot recover from page faults,
> > >       so mappings must be established through ioctl. When
> > >       userspace calls SET_ATTR with ACCESS=ENABLE, the driver
> > >       walks the attr tree and maps all accessible intervals
> > >       to the GPU by amdgpu_svm_range_map_attr_ranges().

Can you expand on XNACK off / GPU no faults? Is this to the share GPU
between 3D (dma-fences) and faulting clients? We have something similar
in Xe, but it isn't an explicit IOCTL rather we switch between on demand
as 3D client submits and then resumes page faults when all dma-fences
have signaled.

I see below you mention page tables are modified during quiesce KFD
queues? I'm not sure that is required - you just need to guarnette
faulting clients won't trigger page faults when dma-fence is in flight.

Maybe give me an explaination of exactly what the requirement from AMD
are here so I have better picture.

> > > 
> > >    4) Invalidation, GC worker, and restore worker
> > > 
> > >       MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
> > >       three cases based on event type and hardware mode:
> > >         - unmap event: clear GPU PTEs in the notifier context,
> > >           unmap DMA pages, mark ranges as unmapped, flush TLB,
> > >           and enqueue to the GC worker. On XNACK off, also
> > >           quiesce KFD queues and schedule rebuild of the
> > >           still valid portions that were destroyed together with
> > >           the unmapped subregion.
> > > 
> > >         - evict on XNACK off:
> > >           quiesce KFD queues first, then unmap DMA pages and
> > >           enqueue to the restore worker.
> > 
> > Is that done through the DMA fence or by talking directly to the MES/HWS?
> 
> Currently KFD queues quiesce/resume API are reused, lookig forward to a
> better solution.
> 

+1

> Regards,
> Honglei
> 
> > 
> > Thanks,
> > Christian.
> > 
> > > 
> > >         - evict on XNACK on:
> > >           clear GPU PTEs, unmap DMA pages, and flush TLB, but do
> > >           not schedule any worker. The GPU will fault on next
> > >           access and the fault handler establishes the mapping.
> > > 
> > > Not supported feature:
> > >    - XNACK on GPU page fault mode
> > >    - migration and prefetch feature
> > >    - Multi GPU support
> > > 
> > >    XNACK on enablement is ongoing.The GPUs that support XNACK on
> > >    are currently only accessible to us via remote lab machines, which 
> > > slows
> > >    down progress.
> > > 
> > > Patch overview:
> > > 
> > >    01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
> > >          SET_ATTR/GET_ATTR operations, attribute types, and related
> > >          structs in amdgpu_drm.h.
> > > 
> > >    02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
> > >          refcount, attr_tree, workqueues, locks, and
> > >          callbacks (begin/end_restore, flush_tlb).
> > > 
> > >    03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
> > >          (interval tree node), attr_tree, access enum, flag masks,
> > >          and change trigger enum.
> > > 
> > >    04/12 Attribute tree operations: interval tree lookup, insert,
> > >          remove, and tree create/destroy lifecycle.
> > > 
> > >    05/12 Attribute set: validate UAPI attributes, apply to internal
> > >          attrs, handle hole/existing range with head/tail splitting,
> > >          compute change triggers, and -EAGAIN retry loop.
> > >          Implements attr_clear_pages for unmap cleanup and attr_get.
> > > 
> > >    06/12 Range data structures: amdgpu_svm_range extending
> > >          drm_gpusvm_range with gpu_mapped state, pending ops,
> > >          pte_flags cache, and GC/restore queue linkage.
> > > 
> > >    07/12 PTE flags and GPU mapping: simple gpu pte function,
> > >          GPU page table update with DMA address, range mapping loop:
> > >          find_or_insert -> get_pages -> validate -> update PTE,
> > >          and attribute change driven mapping function.
> > > 
> > >    08/12 Notifier and invalidation: synchronous GPU PTE clear in
> > >          notifier context, range removal and overlap cleanup,
> > >          rebuild after destroy logic, and MMU event dispatcher
> > > 
> > >    09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
> > >          worker for unmap processing and rebuild, ordered restore
> > >          worker for mapping evicted ranges, and flush/sync
> > >          helpers.
> > > 
> > >    10/12 Initialization and fini: kmem_cache for range/attr,
> > >          drm_gpusvm_init with chunk sizes, XNACK detection, TLB
> > >          flush helper, and amdgpu_svm init/close/fini lifecycle.
> > > 
> > >    11/12 IOCTL and fault handler: PASID based SVM lookup with kref
> > >          protection, amdgpu_gem_svm_ioctl dispatcher, and
> > >          amdgpu_svm_handle_fault for GPU page fault recovery.
> > > 
> > >    12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
> > >          Makefile rules, ioctl table registration, and amdgpu_vm
> > >          hooks (init in make_compute, close/fini, fault dispatch).
> > > 
> > > Test result:
> > >    on gfx1100(W7900) and gfx943(MI300x)
> > >    kfd test: 95%+ passed, same failed cases with offical relase
> > >    rocr test: all passed
> > >    hip catch test: 20 cases failed in all 5366 cases, +13 failures vs 
> > > offical relase
> > > 
> > > During implementation we identified several challenges / design questions:
> > > 
> > > 1. No range splitting on partial unmap
> > > 
> > >    drm_gpusvm explicitly does not support range splitting in 
> > > drm_gpusvm.c:122.
> > >    Partial munmap needs to destroy the entire range including the valid 
> > > interval.
> > >    GPU fault driven hardware can handle this design by extra gpu fault 
> > > handle,
> > >    but AMDGPU needs to support XNACK off hardware, this design requires 
> > > driver
> > >    rebuild the valid part in the removed entire range. Whichs bring a 
> > > very heavy
> > >    restore work in work queue/GC worker: unmap/destroy -> rebuild(insert 
> > > and map)
> > >    this restore work even heavier than kfd_svm. In previous driver work 
> > > queue
> > >    only needs to restore or unmap, but in drm_gpusvm driver needs to 
> > > unmap and restore.
> > >    which brings about more complex logic, heavier worker queue workload, 
> > > and
> > >    synchronization issues.

Is this common in the workload you are running? I'm also wondering if
your restore logic / KFDs design is contributing to this actally the
problem.

> > > 
> > > 2. Fault driven vs ioctl driven mapping
> > > 
> > >    drm_gpusvm is designed around GPU page fault handlers. The primary 
> > > entry
> > >    point drm_gpusvm_range_find_or_insert() takes a fault_addr.
> > >    AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware 
> > > that
> > >    GPU cannot fault at all

I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
issues these so the device does not fault (e.g., prefetch creates a set
of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
specific VM bind operations.

> > > 
> > >    The ioctl path cannot hold mmap_read_lock across the entire operation
> > >    because drm_gpusvm_range_find_or_insert() acquires/releases it
> > >    internally. This creates race windows with MMU notifiers / workers.

This is a very intentional choice in the locking design: mmap_read_lock
is held only in very specific parts of GPU SVM, and the driver should
never need to take this lock.

Yes, notifiers can race, which is why the GPU fault handler and prefetch
handler are structured as retry loops when a notifier race is detected.
In practice, with well-behaved applications, these races should be
rare—but they do occur, and the driver must handle them.

__xe_svm_handle_pagefault implements the page fault retry loop. VM bind
prefetch has similar logic, although it is more spread out given that it
is part of a deeper software pipeline.

FWIW, holding locks to avoid races was rejected by Sima because we
reasoned it is essentially impossible to guarantee the absence of races
by holding a lock. CPU page fault handlers are also effectively just
large retry loops.

So this is one point I believe you will need to fixup driver side.

> > > 
> > > 3. Multi GPU support
> > > 
> > > drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
> > > each GPU gets an independent instance with its own range tree, MMU
> > > notifiers, notifier_lock, and DMA mappings.
> > > 

This is a part I am absolutely open to fixing. Right now, each
drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
decoupling a GPU SVM instance from a single device, allowing each
drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
device).

This would give drivers the flexibility to use one GPU SVM instance per
VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
MM.

> > > This may brings huge overhead:
> > >      - N x MMU notifier registrations for the same address range

The notifier overhead is a real concern. We recently introduced two-pass
notifiers [1] to speed up multi-device notifiers. At least in Xe, the
TLB invalidations—which are the truly expensive part—can be pipelined
using the two=pass approach. Currently, [1] only implements two-pass
notifiers for userptr, but Xe’s GPU SVM will be updated to use them
shortly.

[1] https://patchwork.freedesktop.org/series/153280/

> > >      - N x hmm_range_fault() calls for the same page (KFD: 1x)

hmm_range_fault is extremely fast compared to the actual migration.
Running hmm_range_fault on a 2MB region using 4KB pages takes less
than 1µs. With THP or large device pages [2] (merged last week), it’s
around 1/20 of a microsecond. So I wouldn’t be too concerned about this.

[2] https://patchwork.freedesktop.org/series/163141/

> > >      - N x DMA mapping memory

You will always have N x DMA mapping memory if the pages are in system
memory as the dma-mapping API is per device.

> > >      - N x invalidation + restore worker scheduling per CPU unmap event
> > >      - N x GPU page table flush / TLB invalidation

I agree you do not want serialize GPU page table flush / TLB
invalidations. Hence two-pass notifiers [1].

> > >      - Increased mmap_lock hold time, N callbacks serialize under it
> > > 
> > > compatibility issues:
> > >      - Quiesce/resume scope mismatch: to integrate with KFD compute
> > >        queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
> > >        which have process level semantics. Under the per GPU
> > >        drm_gpusvm model, maybe there are some issues on sync. To properly
> > >        integrate with KFD under the per SVM model, a compatibility or
> > >        new per VM level queue control APIs maybe need to introduced.
> > > 

I thought the idea to get rid of KFD and move over to AMDGPU? I thought
Christian mentioned this to me at XDC.

> > > Migration challenges:
> > > 
> > >    - No global migration decision logic: each per GPU SVM
> > >      instance maintains its own attribute tree independently. This
> > >      allows conflicting settings (e.g., GPU0's SVM sets
> > >      PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
> > >      for the same address range) with no detection or resolution.
> > >      A global attribute coordinator or a shared manager is needed to
> > >      provide a unified global view for migration decisions

Yes, this is hole in the Xe API too. We have told UMDs if they setup
individual VMs with conflict attributes for a single CPU address space
the behavior is undefined. Our UMD implement madvise is basically loop
over al GPU VMs setting the same attributes.

> > > 
> > >    - migrate_vma_setup broadcast: one GPU's migration triggers MMU
> > >      notifier callbacks in ALL N-1 other drm_gpusvm instances,
> > >      causing N-1 unnecessary restore workers to be scheduled. And

My feeling is that you shouldn’t reschedule restore workers unless you
actually have to invalidate page tables (i.e., you have a local SVM
range within the notifier). So the first migration to an untouched
region may trigger notifiers, but they won’t do anything because you
don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
region won’t trigger a notifier unless the memory is moved again.

> > >      creates races between the initiating migration and the other
> > >      instance's restore attempts.

Yes, if multiple devices try to migrate the same CPU pages at the same
time, that will race. That’s why in Xe we have a module-level
driver_migrate_lock. The first migration runs in read mode; if it
detects a race and aborts, it then takes driver_migrate_lock in write
mode so it becomes the only device allowed to move memory / CPU pages.
See xe_svm_alloc_vram() for how this is used.

I’m not sure this approach will work for you, but I just wanted to point
out that we identified this as a potential issue.

> > > 
> > >    - No cross instance migration serialization: each per GPU
> > >      drm_gpusvm instance has independent locking, so two GPUs'
> > >      "decide -> migrate -> remap" sequences can interleave. While
> > >      the kernel page lock prevents truly simultaneous migration of
> > >      the same physical page, the losing side's retry (evict from
> > >      other GPU's VRAM -> migrate back) triggers broadcast notifier
> > >      invalidations and restore workers, compounding the ping pong
> > >      problem above.
> > > 

See the driver_migrate_lock above.

> > >    - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
> > >      hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
> > >      it only selects system memory pages for migration.
> > > 

I think this is fixed? We did find some core MM bugs that blocked VRAM
to VRAM but those have been worked out.

The code I'm looking at:

 517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem 
*devmem_allocation,
 518                                   struct mm_struct *mm,
 519                                   unsigned long start, unsigned long end,
 520                                   const struct drm_pagemap_migrate_details 
*mdetails)
 521 {
 522         const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
 523         struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
 524         struct dev_pagemap *pagemap = dpagemap->pagemap;
 525         struct migrate_vma migrate = {
 526                 .start          = start,
 527                 .end            = end,
 528                 .pgmap_owner    = pagemap->owner,
 529                 .flags          = MIGRATE_VMA_SELECT_SYSTEM | 
MIGRATE_VMA_SELECT_DEVICE_COHERENT |
 530                 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | 
MIGRATE_VMA_SELECT_COMPOUND,
 531         };

> > >    - CPU fault reverse migration race: CPU page fault triggers
> > >      migrate_to_ram while GPU instances are concurrently operating.
> > >      Per GPU notifier_lock does not protect cross GPU operations.

No, again retry loop as discussed above.

> > > 
> > > We believe a strong, well designed solution at the framework level is
> > > needed to properly address these problems, and we look forward to
> > > discussion and suggestions.

Let's work together to figure out what is missing here.

Matt

> > > 
> > > Honglei Huang (12):
> > >    drm/amdgpu: add SVM UAPI definitions
> > >    drm/amdgpu: add SVM data structures and header
> > >    drm/amdgpu: add SVM attribute data structures
> > >    drm/amdgpu: implement SVM attribute tree operations
> > >    drm/amdgpu: implement SVM attribute set
> > >    drm/amdgpu: add SVM range data structures
> > >    drm/amdgpu: implement SVM range PTE flags and GPU mapping
> > >    drm/amdgpu: implement SVM range notifier and invalidation
> > >    drm/amdgpu: implement SVM range workers
> > >    drm/amdgpu: implement SVM core initialization and fini
> > >    drm/amdgpu: implement SVM ioctl and fault handler
> > >    drm/amdgpu: wire up SVM build system and fault handler
> > > 
> > >   drivers/gpu/drm/amd/amdgpu/Kconfig            |   11 +
> > >   drivers/gpu/drm/amd/amdgpu/Makefile           |   13 +
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |    2 +
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c       |  430 ++++++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h       |  147 ++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c  |  894 ++++++++++++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h  |  110 ++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h |   76 ++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   40 +-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |    4 +
> > >   include/uapi/drm/amdgpu_drm.h                 |   39 +
> > >   12 files changed, 2958 insertions(+), 4 deletions(-)
> > >   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
> > >   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
> > >   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
> > >   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
> > >   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
> > >   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
> > > 
> > > 
> > > base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
> > 
>

Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm

Reply via email to