From: Honglei Huang <[email protected]>
V2 of the xnack-off mode SVM patch series.
This revision introduces a centralized non-retryable error classifier,
reworks the attr change path with proper NEED_REMAP trigger.
This patch series implements SVM support with the following design:
- The notifier invalidate callback moves ranges onto a
spinlock-protected invalidated list, like the __vma_userptr_invalidate
in xe_userptr.
- A restore worker iterates the invalidated list, calls
drm_gpusvm_get_pages() to re-acquire pages and GPU
mappings. the same get_pages + rebind flow used by
xe_vm_userptr_pin(). On transient failure, ranges will re-enqueue,
following xe_userptr's retry on EAGAIN pattern.
- Lifecycle follows the same init/fini/flush structure as
xe_userptr_setup/remove/destroy, with flush ensuring all pending
work completes before teardown.
V2:
- Add amdgpu_svm_nonretryable() helper; restore worker and GC worker
now share a single classifier for permanent errors (-ENOENT, -EFAULT,
-EPERM, -EINVAL, -EHWPOISON) instead of open-coded checks.
- Restore worker drops non-retryable errors instead of infinite retry.
- Add AMDGPU_SVM_ATTR_TRIGGER_NEED_REMAP macro as semantic alias for
NEED_INVALIDATE, separating xnack-off (force map) from xnack-on
(invalidate + fault) intent.
- Rework amdgpu_svm_apply_attr_change() xnack-off path: force mapping
when range is accessible, using NEED_REMAP trigger.
- amdgpu_svm_range_put_if_dequeued(): schedule restore work when
pending restore ops remain after GC dequeue.
Related work:
This series depends on the base amdgpu SVM series:
https://lore.kernel.org/amd-gfx/[email protected]/
Test results:
Tested on gfx943 (MI300X) and gfx1100 (W7900) with XNACK off:
- KFD test: 95%+ passed.
- ROCR test: all passed.
- HIP catch test: gfx943 (MI300X): 99% passed.
gfx1100 (W7900): 99% passed.
Patch overview:
Patch 1-2: Define restore types/states and integrate into core headers.
Patch 3: Invalidate callback - dispatch ranges to restore or GC list.
Patch 4: Restore worker - get_pages + rebind loop with non-retryable
error classification and selective retry.
Patch 5: GC worker - remove unmapped ranges, rebuild partial intervals,
skip non-retryable errors via shared helper.
Patch 6: Compute queue quiesce/resume helpers.
Patch 7: Attr change boundary realign helper.
Patch 8: Wire restore into SVM lifecycle and attr set path with
NEED_REMAP trigger and eager remapping for xnack-off.
Honglei Huang (8):
drm/amdgpu: add xnack-off restore types header
drm/amdgpu: integrate xnack-off restore types into core headers
drm/amdgpu: implement xnack-off restore core and invalidate callback
drm/amdgpu: implement xnack-off restore worker
drm/amdgpu: implement xnack-off GC work function
drm/amdgpu: add xnack-off compute queue quiesce and resume helpers
drm/amdgpu: add xnack-off attr change boundary realign helper
drm/amdgpu: wire xnack-off restore into lifecycle and attr set
drivers/gpu/drm/amd/amdgpu/Makefile | 6 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 56 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 3 +
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 3 +
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 8 +
drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 2 +
drivers/gpu/drm/amd/amdgpu/amdgpu_userptr.c | 907 ++++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_userptr.h | 68 ++
8 files changed, 1045 insertions(+), 8 deletions(-)
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_userptr.c
create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_userptr.h
--
2.34.1