[sorry, I hit send too early]

On 2023-10-23 11:15, Christian König wrote:
Am 23.10.23 um 15:06 schrieb Daniel Tang:
That commit causes the screen to freeze a few moments after running
clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer
including ssh also freezes. On v6.5-rc1, it only results in a NULL pointer
deference message in dmesg and the process to become a zombie whose
unkillableness prevents shutdown without REISUB. Although llama.cpp and
hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by
this revert, pytorch-rocm is now working with stability and without
whole-computer freezes caused by any accidental running of clinfo.

This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639.

That result doesn't make much sense. Felix please correct me, but AFAIK the ATS stuff was completely removed by now.

Are you sure that this is pure v6.6-rc7 and not some other patches applied? If yes than we must have missed something.

This revert doesn't really affect systems with ATS. It moves the sanity check back out of the ATS-specific code.

The Null pointer dereference in the bug report comes from the CPU page table update code:

[10089.267556] BUG: kernel NULL pointer dereference, address: 0000000000000000
[10089.267563] #PF: supervisor write access in kernel mode
[10089.267566] #PF: error_code(0x0002) - not-present page
[10089.267569] PGD 0 P4D 0
[10089.267574] Oops: 0002 [#1] PREEMPT SMP NOPTI
[10089.267578] CPU: 23 PID: 18191 Comm: clinfo Tainted: G           OE      
6.5.0-9-generic #9-Ubuntu
[10089.267582] Hardware name: Micro-Star International Co., Ltd. MS-7C37/X570-A 
PRO (MS-7C37), BIOS H.I0 08/10/2022
[10089.267585] RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu]
[10089.267820] Code: 90 90 90 90 90 90 90 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 
00 55 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 48 89 e5 <48> 89 0e 31 c0 
5d 31 d2 31 c9 31 f6 45 31 c0 e9 89 7e 27 fb 66 0f
[10089.267823] RSP: 0018:ffffb49805eeb8b0 EFLAGS: 00010246
[10089.267827] RAX: 0000000000000000 RBX: 0000000000200000 RCX: 0040000000000480
[10089.267830] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9890d4380000
[10089.267832] RBP: ffffb49805eeb8b0 R08: 0040000000000480 R09: 0000000000200000
[10089.267835] R10: 0000000800100200 R11: 0000000800100200 R12: ffffb49805eeba98
[10089.267837] R13: 0000000000000001 R14: 0000000000200000 R15: 0000000000000001
[10089.267840] FS:  00007f8ca9f09740(0000) GS:ffff9897befc0000(0000) 
knlGS:0000000000000000
[10089.267843] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10089.267846] CR2: 0000000000000000 CR3: 00000002e0746000 CR4: 0000000000750ee0
[10089.267849] PKRU: 55555554
[10089.267851] Call Trace:
[10089.267853]  <TASK>
[10089.267858]  ? show_regs+0x6d/0x80
[10089.267865]  ? __die+0x24/0x80
[10089.267870]  ? page_fault_oops+0x99/0x1b0
[10089.267876]  ? do_user_addr_fault+0x316/0x6b0
[10089.267879]  ? srso_alias_return_thunk+0x5/0x7f
[10089.267884]  ? scsi_dispatch_cmd+0x91/0x240
[10089.267891]  ? exc_page_fault+0x83/0x1b0
[10089.267896]  ? asm_exc_page_fault+0x27/0x30
[10089.267904]  ? amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu]
[10089.268140]  amdgpu_vm_cpu_update+0xa9/0x130 [amdgpu]
...

This revert is just a roundabout way of disabling CPU page table updates for compute VMs. But I don't think it really addresses the root cause.

Regards,
  Felix



Regards,
Christian.


Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596
Signed-off-by: Daniel Tang <danielzgtg.opensou...@gmail.com>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++++++------
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 82f25996ff5e..602f311ab766 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm)
      if (r)
          return r;
  +    /* Sanity checks */
+    if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
+        r = -EINVAL;
+        goto unreserve_bo;
+    }
+
      /* Check if PD needs to be reinitialized and do it before
       * changing any other state, in case it fails.
       */
      if (pte_support_ats != vm->pte_support_ats) {
-        /* Sanity checks */
-        if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
-            r = -EINVAL;
-            goto unreserve_bo;
-        }
-
          vm->pte_support_ats = pte_support_ats;
          r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo),
                         false);
-- 2.40.1



Reply via email to