From: Honglei Huang <[email protected]>

V2 of the xnack-off mode SVM patch series.
This revision introduces a centralized non-retryable error classifier,
reworks the attr change path with proper NEED_REMAP trigger.

This patch series implements SVM support with the following design:
  - The notifier invalidate callback moves ranges onto a
    spinlock-protected invalidated list, like the __vma_userptr_invalidate
    in xe_userptr.
  - A restore worker iterates the invalidated list, calls
    drm_gpusvm_get_pages() to re-acquire pages and GPU
    mappings. the same get_pages + rebind flow used by
    xe_vm_userptr_pin(). On transient failure, ranges will re-enqueue,
    following xe_userptr's retry on EAGAIN pattern.
  - Lifecycle follows the same init/fini/flush structure as
    xe_userptr_setup/remove/destroy, with flush ensuring all pending
    work completes before teardown.

V2:
  - Add amdgpu_svm_nonretryable() helper; restore worker and GC worker
    now share a single classifier for permanent errors (-ENOENT, -EFAULT,
    -EPERM, -EINVAL, -EHWPOISON) instead of open-coded checks.
  - Restore worker drops non-retryable errors instead of infinite retry.
  - Add AMDGPU_SVM_ATTR_TRIGGER_NEED_REMAP macro as semantic alias for
    NEED_INVALIDATE, separating xnack-off (force map) from xnack-on
    (invalidate + fault) intent.
  - Rework amdgpu_svm_apply_attr_change() xnack-off path: force mapping
    when range is accessible, using NEED_REMAP trigger.
  - amdgpu_svm_range_put_if_dequeued(): schedule restore work when
    pending restore ops remain after GC dequeue.

Related work:
This series depends on the base amdgpu SVM series:
  
https://lore.kernel.org/amd-gfx/[email protected]/

Test results:
  Tested on gfx943 (MI300X) and gfx1100 (W7900) with XNACK off:
  - KFD test: 95%+ passed.
  - ROCR test: all passed.
  - HIP catch test: gfx943 (MI300X): 99% passed.
                    gfx1100 (W7900): 99% passed.

Patch overview:
  Patch 1-2: Define restore types/states and integrate into core headers.
  Patch 3:   Invalidate callback - dispatch ranges to restore or GC list.
  Patch 4:   Restore worker - get_pages + rebind loop with non-retryable
             error classification and selective retry.
  Patch 5:   GC worker - remove unmapped ranges, rebuild partial intervals,
             skip non-retryable errors via shared helper.
  Patch 6:   Compute queue quiesce/resume helpers.
  Patch 7:   Attr change boundary realign helper.
  Patch 8:   Wire restore into SVM lifecycle and attr set path with
             NEED_REMAP trigger and eager remapping for xnack-off.

Honglei Huang (8):
  drm/amdgpu: add xnack-off restore types header
  drm/amdgpu: integrate xnack-off restore types into core headers
  drm/amdgpu: implement xnack-off restore core and invalidate callback
  drm/amdgpu: implement xnack-off restore worker
  drm/amdgpu: implement xnack-off GC work function
  drm/amdgpu: add xnack-off compute queue quiesce and resume helpers
  drm/amdgpu: add xnack-off attr change boundary realign helper
  drm/amdgpu: wire xnack-off restore into lifecycle and attr set

 drivers/gpu/drm/amd/amdgpu/Makefile           |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c       |  56 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h       |   3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h  |   3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c |   8 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_userptr.c   | 907 ++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_userptr.h   |  68 ++
 8 files changed, 1045 insertions(+), 8 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_userptr.c
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_userptr.h

-- 
2.34.1

Reply via email to