Dear all,
We're facing an issue that is hopefully not directly related to
Lustre itself (we're not using community Lustre), but maybe someone
here has seen something similar or knows someone who has.
On our GPU partition with A100-SXM4-80GB GPUs (VBIOS version:
92.00.36.00.02), we’re trying to read IOPS statistics (osc_stats) via
the files under /sys/kernel/debug/lustre/osc/ (we’re running 160
OSTs, Lustre version 2.14.0_ddn184). Our goal is to sample the data
at 5-second intervals, then aggregate and postprocess it into
readable metrics.
We have a collectd daemon running, which had been stable for a long
time. After integrating the IOPS metric, however, we occasionally hit
a kernel panic (see crash dump excerpts below). The issue appears to
originate somewhere in the GPU firmware stack, but we're unsure why
this happens and how it's related to reading Lustre metrics.
The problem occurs often, but is hard to reproduce and happens at
random. We’re hesitant to run the scripts frequently since a crash
could interrupt critical GPU workloads. That said, limited test runs
over several hours often work fine, especially after a fresh reboot.
The CPU-only nodes run the same scripts without issues all the time.
Could this be a sign that /sys/kernel/debug is being overwhelmed
somehow? Although that shouldn’t normally cause a kernel panic.
We’d appreciate any insights, experiences, or pointers, even indirect
ones.
Thanks in advance!
Anna
2024-12-17 17:11:28 [2453606.802826] NVRM: Xid (PCI:0000:03:00): 120,
pid='<unknown>', name=<unknown>, GSP task timeout @ pc:0x4bd36c4,
task:
1
2024-12-17 17:11:28 [2453606.802835] NVRM: Reported by libos
task:0 v2.0 [0] @ ts:1734451888
2024-12-17 17:11:28 [2453606.802837] NVRM: RISC-V CSR State:
2024-12-17 17:11:28 [2453606.802840] NVRM:
mstatus:0x000000001e000000 mscratch:0x0000000000000000
mie:0x0000000000000880 mip:0x
0000000000000000
2024-12-17 17:11:28 [2453606.802842] NVRM:
mepc:0x0000000004bd36c4 mbadaddr:0x00000100badca700
mcause:0x8000000000000007
2024-12-17 17:11:28 [2453606.802844] NVRM: RISC-V GPR State:
[...]
2024-12-17 17:11:29 [2453606.803121] NVRM: Xid (PCI:0000:03:00): 140,
pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected
(p
ossible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0,
PCIE:0
[...]
2024-12-17 17:30:03 [2454721.362906] Kernel panic - not syncing:
Fatal exception
2024-12-17 17:30:03 [2454721.611822] Kernel Offset: 0x5200000 from
0xffffffff81000000 (relocation range: 0xffffffff80000000-
0xffffffffbfffffff)
2024-12-17 17:30:03 [2454721.770927] ---[ end Kernel panic - not
syncing: Fatal exception ]---
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org