gpusvm: split MM and device state across

Matthew Brost Tue, 09 Jun 2026 20:44:22 -0700

On Wed, Jun 03, 2026 at 02:56:15PM +0800, Honglei Huang wrote:
> From: Honglei Huang <[email protected]>
> 
> The intent of this series is to make drm_gpusvm more flexible and
> give drivers more freedom over how they assemble the MM related and device
> side operations.
> 
> This RFC implements the direction Matt suggested in [1]:
> 
>   - Move struct drm_gpusvm_pages out of struct drm_gpusvm_range.
>   - Embed either a struct device or a struct drm_device in struct
>     drm_gpusvm_pages.
>   - Drop struct drm_device from struct drm_gpusvm.
>   - Have the driver's range structure embed one or more struct
>     drm_gpusvm_pages in addition to struct drm_gpusvm_range.
>   - Refactor a few range-based helpers (drm_gpusvm_range_pages_valid,
>     drm_gpusvm_range_get_pages, drm_gpusvm_range_unmap_pages), or
>     simply drop them entirely and update drivers to use the
>     drm_gpusvm_pages helpers instead.
>


Overall this looks good - thanks doing this.

> In essence the series does only two abstractions, plus the xe
> adaptation that follows from them:
> 
>   - range vs pages: split drm_gpusvm_range (MM / VA range state) from
>     drm_gpusvm_pages (device physical related), so the two
>     sides can have independent lifetimes and ownership.
>   - drm_gpusvm vs drm_device: make drm_gpusvm pure MM level and push
>     the device side down onto drm_gpusvm_pages, which is where DMA 
>     actually happens.
>   - xe is updated to fit the modifications, no functional change intended.
> 
> If such changes are acceptable in terms of direction, I have a few questions:
> 
>  - Drivers now own drm_gpusvm_pages unmap / free and notifier_seq init. 
>    OK to push this fully to drivers, or should some new mechanisms need to 
> add 
>    to ensure functions can be completed by the framework?

I'm looking at the diff of xe_svm.c before / after and I see
drm_gpusvm_free_pages moved to xe_svm_range_free. That looks fine to me.

I see in xe_svm_range_alloc() this:

range->pages.notifier_seq = LONG_MAX;

Can we make help like drm_gpusvm_init_pages which does this? I think it
is better to encapsulate the pages init into normalized helper even
though it is very simple. Maybe an inline since this just a single line
of code?

>  - This series drops the three drm_gpusvm_range_* helpers and changes
>    drm_gpusvm_get_pages() / drm_gpusvm_init() signatures.
>    Do we need to keep thin wrappers for backward compatibility.

It should be safe to drop these helpers.

>  - drm_gpusvm_get_pages() mixes HMM fault and device DMA map. Multi device 
> under 
>    one SVM calls would repeat the HMM fault. Does it need to modified to Split
>    into MM level fault + per pages DMA map?
>

Hmm, this might get a little tricky because of how the allocation/retry
loop is implemented in get_pages(). Maybe we could change the function
to accept an array of pages plus a count? I’m not sure what the best
approach is here, but I’m open to ideas. That said, I’d rather avoid
having the driver open-code a retry loop if it could live in common
code.

Side note: another modification we need in get_pages() is to make the
DMA-mapping step optional. I suggested that AMDXDNA use GPU SVM for
userptr, and I don’t believe that device requires DMA mapping.
 
> Patch overview:
> 
>   1/5 gpusvm: split MM state flags onto drm_gpusvm_range_flags.
>   2/5 gpusvm: embed drm_device into drm_gpusvm_pages; DMA goes
>       through it.
>   3/5 xe: xe_svm_range owns its drm_gpusvm_pages and its lifecycle.
>   4/5 gpusvm: drop pages from drm_gpusvm_range and the range-level
>       wrappers.
>   5/5 gpusvm: drop drm_device from drm_gpusvm.
> 
> tests:
> AMDGPU:
>   based on amdgpu adaptation patch in [2], but still SVM:DRM = 1:1,
>   1:n is on going needs many modifications and testings.
> 
>   Tested on gfx943 (MI300X) and gfx906 (MI60) with XNACK on/off:
>   - KFD test: 95%+ passed.
>   - ROCR test: all passed.
>   - HIP catch test: gfx943 (MI300X): 96% passed.
>                     gfx906 (MI60): 99% passed.
> INTEL XE:
>   TODO: We bought some Intel Arc A380, but it seems like this cards 
>   don't support hardware fault / SVM, waiting for the new 
>   cards B580/B570 to arrive.
>

Please send patches that modify GPU SVM or Xe to the Xe mailing list. We
have public CI, which I believe can be triggered by any AMD email
address.

I just pulled the code, encountered a compile error, and noticed a bug
around unmapping related to that error. I put together some quick fixes
on top of the series here [3], and locally all of our tests seem to be
passing.

I’ll reply in detail to the patches shortly, explaining some of the
reasoning behind these changes.

Matt

[3] 
https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/623f6a50c037d9e44f6c9fbe6859a0ba7ad50177
 
> links:
> [1] 
> https://lore.kernel.org/amd-gfx/acRgr7QwdULsn6G2@gsse-cloud1/#:~:text=I%20think%20roughly,drm_gpusvm_pages%0A%20%20helpers%20instead.
> [2] 
> https://lore.kernel.org/amd-gfx/[email protected]/
> Honglei Huang (5):
>   drm/gpusvm: split MM state flags out of drm_gpusvm_pages_flags
>   drm/gpusvm: embed struct drm_device into drm_gpusvm_pages
>   drm/xe: have xe_svm_range embed one drm_gpusvm_pages
>   drm/gpusvm: move struct drm_gpusvm_pages out of struct
>     drm_gpusvm_range
>   drm/gpusvm: let the drm_gpusvm core context purely MM level
> 
>  drivers/gpu/drm/drm_gpusvm.c    | 128 +++++++++-----------------------
>  drivers/gpu/drm/xe/xe_pt.c      |   2 +-
>  drivers/gpu/drm/xe/xe_svm.c     |  37 +++++----
>  drivers/gpu/drm/xe/xe_svm.h     |  11 ++-
>  drivers/gpu/drm/xe/xe_userptr.c |   1 +
>  include/drm/drm_gpusvm.h        |  49 ++++++------
>  6 files changed, 95 insertions(+), 133 deletions(-)
> 
> -- 
> 2.34.1
>

Re: [RFC 0/5] drm/gpusvm: split MM and device state across

Reply via email to