Hi,

I would like to make sure this AMDKFD SVM regression is tracked by the
Linux regression process.

GitLab report:

  https://gitlab.freedesktop.org/drm/amd/-/work_items/4914

The regression was originally reported on 2026-01-27. It was bisected to the
same functional change that Alex Deucher's revert patch later targeted:

  448ee45353ef9fb1a34f5f26eb3f48923c6f0898
  drm/amdkfd: Use huge page size to check split svm range alignment

The affected kernel line I tested identifies the same change as:

  bf2084a7b1d75d093b6a79df4c10142d49fbaa0e

Alex's revert patch:

https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html

A small C/HSA reproducer is now available in the GitLab report. It does not
require PyTorch, ComfyUI, Docker, model files, or the original workload. It
uses ROCr/HSA, an anonymous THP-advised host mapping, explicit KFD SVM
SET_ATTR ioctls, and an HSA SDMA D2H copy.

Single reproducer command, same binary on both kernels:

  ./kfd_svm_split_hsa_copy --upstream-ab

Same-machine A/B result on an RX 7600 XT:

  448ee453/bf2084a7 active:
    1/1 run faults with SDMA0 permission fault
    GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51

  448ee453/bf2084a7 locally reverted:
    10/10 runs complete
    no ROCr memory access fault
    no new GCVM/SDMA0 permission fault in dmesg

The bad fault page is inside the split tail and inside the SDMA copy range:

  critical tail: [0x722429d61..0x722429dff]
  copy pages:    [0x722429b30..0x722429d70]
  fault page:    0x722429d65

A full ftrace/PTE run with the same C reproducer/SVM sequence also shows:

  split_tail ... current_remap=0 old_remap=1 missed=1
  MISSED_REMAP_CANDIDATE split=tail
  no amdgpu_vm_update_ptes covering the fault page after the marker before
  the fault-side GET_ATTR

The suspected code issue is that the split-tail/head remap predicate introduced
by 448ee453/bf2084a7 can miss tails inside the final 512-page block. Since
prange->last is inclusive, ALIGN_DOWN(prange->last, 512) is the start of the
final block, not an exclusive upper bound.

I also sent a short follow-up to amd-gfx with the reproducer/A-B summary and
asked what original failure or workload 448ee453/bf2084a7 was intended to fix:

https://lists.freedesktop.org/archives/amd-gfx/2026-June/145800.html

I can resend the reproducer source and summaries directly on-list if preferred.

#regzbot introduced: 448ee45353ef9fb1a34f5f26eb3f48923c6f0898
#regzbot monitor: https://gitlab.freedesktop.org/drm/amd/-/work_items/4914

Thanks,
Gerhard Schwanzer

Attachment: publickey - [email protected] - 0xE32DB141.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to