Replying to Alex's revert patch:
https://lists.freedesktop.org/archives/amd-gfx/2026-February/138824.html

Hi,

The requested small reproducer for drm/amd#4914 is now available:

  https://gitlab.freedesktop.org/drm/amd/-/work_items/4914

It is a small C/HSA reproducer. It does not require PyTorch, ComfyUI,
Docker, model files, or the original workload.

Same binary, same command on both kernels:

  ./kfd_svm_split_hsa_copy --upstream-ab

A/B result on the same RX 7600 XT system:

  bf2084a7 active:
    1/1 run faults with SDMA0 permission fault
    GCVM_L2_PROTECTION_FAULT_STATUS=0x00841A51

  bf2084a7 reverted:
    10/10 runs complete
    no ROCr memory access fault
    no new GCVM/SDMA0 permission fault in dmesg

The bad fault page is inside the split tail and inside the SDMA copy range:

  critical tail: [0x722429d61..0x722429dff]
  copy pages:    [0x722429b30..0x722429d70]
  fault page:    0x722429d65

A full ftrace/PTE run with the same C reproducer/SVM sequence also shows:

  split_tail ... current_remap=0 old_remap=1 missed=1
  MISSED_REMAP_CANDIDATE split=tail
  no amdgpu_vm_update_ptes covering the fault page after the marker before
  the fault-side GET_ATTR

One important open question for me is:

  What original failure or workload was 448ee453/bf2084a7 intended to fix?

If there is a test case for the original problem, I can check whether a
replacement fix covers both that case and this regression.

Could this revert, or an equivalent fix, be reconsidered?

I can resend the reproducer and summaries directly on-list if preferred.

Thanks,
Gerhard Schwanzer

Attachment: publickey - [email protected] - 0xE32DB141.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to