This series adds VRAM migration support to amdgpu's SVM (Shared Virtual Memory) implementation, using the drm_pagemap framework for ZONE_DEVICE page management and SDMA for data migration.
This is the XNACK-on (GPU fault-driven) version of the migration series, built on top of the drm_gpusvm-based amdgpu SVM core [1]. Previous v1/v2/v3 were XNACK-off (ioctl-driven) based on an earlier SVM core; this v4 is a rewrite targeting the XNACK-on path. The implementation follows the Xe driver's approach for TTM eviction, using synchronous bo_move to migrate device-private pages back to system RAM when TTM needs to evict SVM BOs. Key design points: - GPU VRAM registered as ZONE_DEVICE via devm_memremap_pages(), wrapped in struct amdgpu_pagemap with drm_pagemap state - SDMA-based data transfer through GART aperture window for both copy_to_devmem and copy_to_ram callbacks - amdgpu_bo_svm: lightweight BO subtype with drm_pagemap_devmem for ZONE_DEVICE page ownership tracking - Synchronous TTM eviction via drm_pagemap_evict_to_ram() in bo_move, following the Xe pattern (no eviction fences needed) - Migration policy driven by SVM range attributes (preferred location, prefetch hints) and GPU fault path Limitations: - Single GPU only; multi-GPU migration is not addressed - No VRAM-to-VRAM (peer GPU) migration Open issue: - Unnecessary TTM system memory allocation during eviction: when TTM evicts an SVM BO, it allocates a destination system memory resource (TTM_PL_SYSTEM) before calling bo_move, then frees it afterwards. This allocation is unnecessary because the actual data migration is done via drm_pagemap_evict_to_ram() → migrate_device_* which migrates device-private pages directly to regular system pages, bypassing the TTM-allocated resource entirely. The current TTM framework does not support num_placement=0 to skip this redundant allocation; this needs further discussion. Dependencies: This series applies on top of the amdgpu drm_gpusvm SVM core [1]. [1] https://lore.kernel.org/amd-gfx/[email protected]/ Changes since v3: - Rebased on drm_gpusvm-based amdgpu SVM core [1], switching from XNACK-off ioctl-driven to XNACK-on GPU fault-driven migration - Introduced amdgpu_bo_svm subtype with drm_pagemap_devmem embedding and two-layer reference counting (GEM refcount + TTM kref) - Added synchronous TTM eviction via drm_pagemap_evict_to_ram() in amdgpu_bo_move(), following the Xe driver pattern - Added amdgpu_bo_is_amdgpu_bo() check for SVM BOs in TTM path - Cleaned up container_of macros to follow amdgpu conventions (to_amdgpu_bo_svm as #define, devmem_to_amdgpu_bo_svm as inline) Changes since v2: - Moved amdgpu_pagemap entirely to amdgpu side, eliminating all KFD modifications - Split commits for better reviewability: separated infrastructure from SDMA callbacks, decision layer from integration - Merged ZONE_DEVICE registration hook into the integration patch Changes since v1: - Dropped the eviction fence patch (was 4/6) after Christian König pointed out it violates the dma_fence contract - Refactored migration integration: extracted migration logic into new files amdgpu_svm_range_migrate.{c,h} - Introduced enum amdgpu_svm_migrate_mode (PREFERRED, TO_VRAM, TO_SYSMEM, NONE) to make migration intent explicit, replacing the _ex functions used in v1 Previous versions: v1 (XNACK-off): https://lore.kernel.org/amd-gfx/[email protected]/ v2 (XNACK-off): https://lore.kernel.org/amd-gfx/[email protected]/ v3 (XNACK-off): https://lore.kernel.org/amd-gfx/[email protected]/ Test results: Tested on gfx943 (MI300X) and gfx906 (MI60) with XNACK on: - KFD test: 95%+ passed. - ROCR test: all passed. Patch overview: 1/6 Core VRAM migration infrastructure (ZONE_DEVICE registration, amdgpu_pagemap, amdgpu_bo_svm subtype, drm_pagemap_ops) 2/6 SDMA migration callbacks (copy_to_devmem, copy_to_ram, populate_devmem_pfn via GART aperture window) 3/6 Synchronous TTM eviction for SVM BOs (amdgpu_svm_bo_evict in bo_move path, amdgpu_bo_is_amdgpu_bo check) 4/6 SVM range migration helpers (range-level migrate_to_vram / migrate_to_sysmem decision layer) 5/6 Hook up ZONE_DEVICE registration in device init and GPU reset 6/6 Wire up VRAM migration into SVM range map and GPU fault paths Junhua Shen (6): drm/amdgpu: add VRAM migration infrastructure for drm_pagemap drm/amdgpu: implement drm_pagemap SDMA migration callbacks drm/amdgpu: implement synchronous TTM eviction for SVM BOs drm/amdgpu: add SVM range migration helpers for drm_pagemap drm/amdgpu: hook up ZONE_DEVICE registration in device init and reset drm/amdgpu: integrate VRAM migration into SVM range map and fault paths drivers/gpu/drm/amd/amdgpu/Makefile | 6 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 8 + drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.c | 831 ++++++++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.h | 110 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 4 +- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_svm_fault.c | 9 +- drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 21 +- .../drm/amd/amdgpu/amdgpu_svm_range_migrate.c | 122 +++ .../drm/amd/amdgpu/amdgpu_svm_range_migrate.h | 47 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 20 + 13 files changed, 1181 insertions(+), 9 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.c create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_migrate.h create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range_migrate.c create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range_migrate.h -- 2.34.1
