This series adds VRAM migration support to amdgpu's SVM (Shared Virtual
Memory) implementation, using the drm_pagemap framework for ZONE_DEVICE
page management and SDMA for data migration.
This v5 is the unified version supporting both XNACK-on (GPU fault-driven)
and XNACK-off (ioctl-driven / restore-based) migration paths, built on top
of the drm_gpusvm-based amdgpu SVM core [1] and xnack-off restore
infrastructure [2]. Previous v1/v2/v3 were XNACK-off only; v4 was XNACK-on
only; this v5 merges both paths into a single unified series.
The implementation follows the Xe driver's approach for TTM eviction,
using synchronous bo_move to migrate device-private pages back to system
RAM when TTM needs to evict SVM BOs.
Key design points:
- GPU VRAM registered as ZONE_DEVICE via devm_memremap_pages(),
wrapped in struct amdgpu_pagemap with drm_pagemap state
- SDMA-based data transfer through GART aperture window for both
copy_to_devmem and copy_to_ram callbacks
- amdgpu_bo_svm: lightweight BO subtype with drm_pagemap_devmem for
ZONE_DEVICE page ownership tracking
- Synchronous TTM eviction via drm_pagemap_evict_to_ram() in bo_move,
following the Xe pattern (no eviction fences needed)
- Migration policy driven by SVM range attributes (preferred location,
prefetch hints) for both XNACK-on fault path and XNACK-off restore path
- XNACK-off integration: migration triggered in restore worker and
attr-change boundary realign paths
Limitations:
- Single GPU only; multi-GPU migration is not addressed
- No VRAM-to-VRAM (peer GPU) migration
Open issue:
- Unnecessary TTM system memory allocation during eviction: when TTM
evicts an SVM BO, it allocates a destination system memory resource
(TTM_PL_SYSTEM) before calling bo_move, then frees it afterwards.
This allocation is unnecessary because the actual data migration is
done via drm_pagemap_evict_to_ram() → migrate_device_* which
migrates device-private pages directly to regular system pages,
bypassing the TTM-allocated resource entirely. The current TTM
framework does not support num_placement=0 to skip this redundant
allocation; this needs further discussion.
Dependencies:
This series applies on top of:
[1] amdgpu drm_gpusvm SVM core (xnack-on):
https://lore.kernel.org/amd-gfx/[email protected]/
[2] amdgpu drm_gpusvm SVM xnack-off restore:
https://lore.kernel.org/amd-gfx/[email protected]/
Changes since v4:
- Unified XNACK-on and XNACK-off migration into a single series
(v4 was XNACK-on only)
- Added patch 6/8: refactor SVM attr devmem_possible and prefer_vram
API for cleaner integration with both paths
- Added patch 8/8: integrate VRAM migration into XNACK-off SVM restore
and attr-change boundary realign paths
- Split the integration patch into two: fault/prefetch path (7/8) and
restore/realign path (8/8) for better reviewability
- Rebased on latest SVM core with xnack-off restore support
Changes since v3:
- Rebased on drm_gpusvm-based amdgpu SVM core [1], switching from
XNACK-off ioctl-driven to XNACK-on GPU fault-driven migration
- Introduced amdgpu_bo_svm subtype with drm_pagemap_devmem embedding
and two-layer reference counting (GEM refcount + TTM kref)
- Added synchronous TTM eviction via drm_pagemap_evict_to_ram() in
amdgpu_bo_move(), following the Xe driver pattern
- Added amdgpu_bo_is_amdgpu_bo() check for SVM BOs in TTM path
- Cleaned up container_of macros to follow amdgpu conventions
(to_amdgpu_bo_svm as #define, devmem_to_amdgpu_bo_svm as inline)
Changes since v2:
- Moved amdgpu_pagemap entirely to amdgpu side, eliminating all KFD
modifications
- Split commits for better reviewability: separated infrastructure
from SDMA callbacks, decision layer from integration
- Merged ZONE_DEVICE registration hook into the integration patch
Changes since v1:
- Dropped the eviction fence patch (was 4/6) after Christian König
pointed out it violates the dma_fence contract
- Refactored migration integration: extracted migration logic into
new files amdgpu_svm_range_migrate.{c,h}
- Introduced enum amdgpu_svm_migrate_mode (PREFERRED, TO_VRAM,
TO_SYSMEM, NONE) to make migration intent explicit, replacing
the _ex functions used in v1
Previous versions:
v1 (XNACK-off):
https://lore.kernel.org/amd-gfx/[email protected]/
v2 (XNACK-off):
https://lore.kernel.org/amd-gfx/[email protected]/
v3 (XNACK-off):
https://lore.kernel.org/amd-gfx/[email protected]/
v4 (XNACK-on):
https://lore.kernel.org/amd-gfx/[email protected]/
Test results:
Tested on gfx943 (MI300X) and gfx906 (MI60) with both XNACK on and off:
- KFD test: 95%+ passed (both modes).
- ROCR test: 98%+ passed (both modes).
Patch overview:
1/8 Core VRAM migration infrastructure (ZONE_DEVICE registration,
amdgpu_pagemap, amdgpu_bo_svm subtype, drm_pagemap_ops)
2/8 SDMA migration callbacks (copy_to_devmem, copy_to_ram,
populate_devmem_pfn via GART aperture window)
3/8 Synchronous TTM eviction for SVM BOs (amdgpu_svm_bo_evict
in bo_move path, amdgpu_bo_is_amdgpu_bo check)
4/8 Hook up ZONE_DEVICE registration in device init and GPU reset
5/8 SVM range migration helpers (range-level migrate_to_vram /
migrate_to_sysmem decision layer)
6/8 Refactor SVM attr devmem_possible and prefer_vram API for
unified xnack-on/off usage
7/8 Wire up VRAM migration into SVM fault and prefetch paths
(XNACK-on)
8/8 Wire up VRAM migration into SVM restore and attr-change
realign paths (XNACK-off)
Junhua Shen (8):
drm/amdgpu: add VRAM migration infrastructure for drm_pagemap
drm/amdgpu: implement drm_pagemap SDMA migration callbacks
drm/amdgpu: implement synchronous TTM eviction for SVM BOs
drm/amdgpu: hook up ZONE_DEVICE registration in device init and reset
drm/amdgpu: add SVM range migration helpers for drm_pagemap
drm/amdgpu: refactor SVM attr devmem_possible and prefer_vram API
drm/amdgpu: integrate VRAM migration into SVM fault and prefetch paths
drm/amdgpu: integrate VRAM migration into SVM restore and realign paths
drivers/gpu/drm/amd/amdgpu/Makefile | 8 +-
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 8 +
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +
drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.c | 831 +++++++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.h | 102 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 4 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 2 +
drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 12 +
drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 4 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 18 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 5 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_fault.c | 20 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 32 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 1 +
.../gpu/drm/amd/amdgpu/amdgpu_svm_range_migrate.c | 115 +++
.../gpu/drm/amd/amdgpu/amdgpu_svm_range_migrate.h | 35 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 20 +
drivers/gpu/drm/amd/amdgpu/amdgpu_userptr.c | 85 ++-
18 files changed, 1244 insertions(+), 60 deletions(-)
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.c
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.h
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range_migrate.c
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range_migrate.h
--
2.34.1