On 1/18/23 15:03, Étienne Mollier wrote:
2. The symbol tracking needs to be reviewed by somebody more experienced
than me. I think that anything in the rocrand::detail or
rocrand_host::detail namespace should be marked optional, as those symbols
are not intended for use by library users.
Ideally they should not be exposed (by the mean the build flag
-fvisibility=hidden allows, but I'm not sure of implementation
details on upstream side to be honest). If the symbols are not
part of the public interface but still referenced, but we are
sure they are unused by reverse dependencies, they probably can
be marked (optional). The library soversion suggests the stable
part of the ABI should not have had a breakage, so I guess the
(optional) marker is fine.
After further thought, I've added a patch to hide the kernels by marking
them as static and removed all optional symbols from the symbol
tracking. The ABI is a lot easier to verify with all that junk hidden.
I wanted to take that opportunity to stabilize the test suite of
rocm-hipamd, but I'm currently failing on:
test 103
Start 103: directed_tests/ipc/hipMultiProcIpcMem--N4.tst
103: Test command: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/directed_tests/ipc/hipMultiProcIpcMem "
" "--N" "4"
103: Working Directory: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
103: Environment variables:
103: HIP_PATH=/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
103: Test timeout computed to be: 1500
103: KFD does not support xnack mode query.
103: ROCr must assume xnack is disabled.
103: error: 'hipErrorInvalidDevicePointer'(17) from hipIpcGetMemHandle(&ipc_handle,
ipc_offset_dptr) at /<<PKGBUILDDIR>>/hip/tests/src/ipc/hipMultiProcIpcMem.cpp:55
103: error: API returned error code.
103: error: TEST FAILED
103:
103/414 Test #103: directed_tests/ipc/hipMultiProcIpcMem--N4.tst
.......................................................................................Subprocess
aborted***Exception: 792.07 sec
A later test then crashes:
test 126
Start 126: directed_tests/printf/hipPrintfManyWaves.tst
126: Test command:
/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/directed_tests/printf/hipPrintfManyWaves "
"
126: Working Directory: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
126: Environment variables:
126: HIP_PATH=/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
126: Test timeout computed to be: 1500
126: KFD does not support xnack mode query.
126: ROCr must assume xnack is disabled.
126: Memory access fault by GPU node-1 (Agent handle: 0x562e8fbb1e00)
on address (nil)(may not be exact address). Reason: DRAM ECC failure.
126: Nearby memory map:
126: 0x7f5497800000, 0x78c000, System
126: 0x7f549ac00000, 0x100000, VRAM
126: 0x7f549af00000, 0x80000, System
126:
126: PtrInfo:
126: Address:
0x7f5497800000-0x7f5497f8c000/0x7f5497800000-0x7f5497f8c000
126: Size: 0x78c000
126: Type: 1
126: Owner: 0x562e8fbac4b0
126: CanAccess: 1
126: 0x562e8fbb1e00
126: In block: 0x7f5497800000, 0x78c000
126: PtrInfo:
126: Address:
0x7f549ac00000-0x7f549ad00000/0x7f549ac00000-0x7f549ad00000
126: Size: 0x100000
126: Type: 1
126: Owner: 0x562e8fbb1e00
126: CanAccess: 1
126: 0x562e8fbb1e00
126: In block: 0x7f549ac00000, 0x200000
126: PtrInfo:
126: Address:
0x7f549af00000-0x7f549af80000/0x7f549af00000-0x7f549af80000
126: Size: 0x80000
126: Type: 1
126: Owner: 0x562e8fbac4b0
126: CanAccess: 1
126: 0x562e8fbb1e00
126: In block: 0x7f549af00000, 0x80000
126: hipPrintfManyWaves: ./src/core/runtime/runtime.cpp:1276: static bool
rocr::core::Runtime::VMFaultHandler(hsa_signal_value_t, void*): Assertion `false &&
"GPU memory access fault."' failed.
126/414 Test #126: directed_tests/printf/hipPrintfManyWaves.tst
........................................................................................Subprocess
aborted***Exception: 0.64 sec
That's unfortunate. It can be quite difficult to tell why these things fail.
About at the same time as #126 I get a kernel NULL pointer
dereference:
amdgpu: sq_intr: error, se 2, data 0x25, sh 0, priv 0, wave_id 0,
simd_id 0, cu_id 0, err_type 4
amdgpu 0000:0b:00.0: amdgpu: RAS poison consumption, unmap queue flow
succeeded: client id 10
BUG: kernel NULL pointer dereference, address: 00000000000001b0
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 7 PID: 206 Comm: kworker/7:1H Not tainted 6.1.0-1-amd64 #1 Debian
6.1.4-1
Hardware name: Gigabyte Technology Co., Ltd. X570 UD/X570 UD, BIOS F3
09/04/2019
Workqueue: KFD IH interrupt_wq [amdgpu]
RIP: 0010:sienna_cichlid_get_ecc_info+0x8c/0xe0 [amdgpu]
Code: e8 d9 cf 01 00 85 c0 0f 85 58 f4 2c 00 48 8b 83 18 01 00 00 48 89 ea 48
8d b0 80 01 00 00 0f b7 48 10 48 83 c0 18 48 83 c2 20 <66> 89 4a e0 0f b7 48 fa
66 89 4a e2 48 8b 48 e8 48 89 4a e8 48 8b
RSP: 0018:ffff9bf540b17d30 EFLAGS: 00010202
RAX: ffff891a4ae66018 RBX: ffff891a4c33f000 RCX: 0000000000000000
RDX: 00000000000001d0 RSI: ffff891a4ae66180 RDI: ffff891a4ae66180
RBP: 00000000000001b0 R08: 0000000000000000 R09: ffff9bf540b17ba8
R10: 0000000000000003 R11: ffff89395f2f1c28 R12: ffff891a4c33f000
R13: 0000000000000000 R14: ffff891a40e5a840 R15: ffff891a59ccce18
FS: 0000000000000000(0000) GS:ffff8938debc0000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000001b0 CR3: 000000011bcc2000 CR4: 0000000000350ee0
Call Trace:
<TASK>
smu_get_ecc_info+0x1f/0x30 [amdgpu]
amdgpu_dpm_get_ecc_info+0x39/0x60 [amdgpu]
amdgpu_umc_do_page_retirement.constprop.0+0x38/0x170 [amdgpu]
amdgpu_umc_poison_handler+0x64/0xb0 [amdgpu]
amdgpu_amdkfd_ras_poison_consumption_handler+0x48/0x70 [amdgpu]
interrupt_wq+0xcf/0x120 [amdgpu]
process_one_work+0x1c7/0x380
worker_thread+0x4d/0x380
? _raw_spin_lock_irqsave+0x23/0x50
? rescuer_thread+0x3a0/0x3a0
kthread+0xe9/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>
Modules linked in: overlay cpufreq_userspace cpufreq_powersave
cpufreq_ondemand cpufreq_conservative binfmt_misc nls_ascii nls_cp437 vfat fat
intel_rapl_msr intel_rapl_common amdgpu edac_mce_amd kvm_amd
snd_hda_codec_realtek kvm snd_hda_codec_generic ledtrig_audio irqbypass
snd_hda_codec_hdmi ghash_clmulni_intel sha512_ssse3 gpu_sched snd_hda_intel
sha512_generic snd_intel_dspcfg drm_buddy snd_intel_sdw_acpi video
snd_hda_codec drm_display_helper snd_hda_core cec rc_core snd_hwdep aesni_intel
snd_pcm drm_ttm_helper crypto_simd ttm cryptd snd_timer drm_kms_helper
gigabyte_wmi rapl snd pcspkr ccp wmi_bmof i2c_algo_bit sp5100_tco watchdog
k10temp soundcore rng_core evdev button acpi_cpufreq sg parport_pc ppdev lp drm
parport fuse efi_pstore configfs efivarfs ip_tables x_tables autofs4 xfs btrfs
zstd_compress raid1 dm_raid raid456 async_raid6_recov async_memcpy async_pq
async_xor async_tx md_mod xor raid6_pq libcrc32c crc32c_generic dm_mod sd_mod
hid_generic usbhid hid ahci nvme
xhci_pci libahci xhci_hcd nvme_core libata r8169 t10_pi realtek
mdio_devres crc32_pclmul crc64_rocksoft crc32c_intel crc64 usbcore libphy
crc_t10dif scsi_mod i2c_piix4 crct10dif_generic scsi_common usb_common
crct10dif_pclmul crct10dif_common wmi
CR2: 00000000000001b0
---[ end trace 0000000000000000 ]---
RIP: 0010:sienna_cichlid_get_ecc_info+0x8c/0xe0 [amdgpu]
Code: e8 d9 cf 01 00 85 c0 0f 85 58 f4 2c 00 48 8b 83 18 01 00 00 48 89 ea 48
8d b0 80 01 00 00 0f b7 48 10 48 83 c0 18 48 83 c2 20 <66> 89 4a e0 0f b7 48 fa
66 89 4a e2 48 8b 48 e8 48 89 4a e8 48 8b
RSP: 0018:ffff9bf540b17d30 EFLAGS: 00010202
RAX: ffff891a4ae66018 RBX: ffff891a4c33f000 RCX: 0000000000000000
RDX: 00000000000001d0 RSI: ffff891a4ae66180 RDI: ffff891a4ae66180
RBP: 00000000000001b0 R08: 0000000000000000 R09: ffff9bf540b17ba8
R10: 0000000000000003 R11: ffff89395f2f1c28 R12: ffff891a4c33f000
R13: 0000000000000000 R14: ffff891a40e5a840 R15: ffff891a59ccce18
FS: 0000000000000000(0000) GS:ffff8938debc0000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000001b0 CR3: 000000011bcc2000 CR4: 0000000000350ee0
If there's a null pointer dereference in AMDGPU, I would presume that's
a kernel bug. I don't think the HIP library has any special powers that
another user program wouldn't.
This is the sort of thing where we would really benefit from a CI for
the GPU code. We run the tests so infrequently that there's a lot of
changes in the components and their dependencies between runs. If we
reran the tests every time there was a relevant change, like a kernel
update, it would help to narrow down the source of these problems. Also,
if wishes were fishes, I'd never go hungry.
Thanks,
Cory Bloor