Hi,
thanks for the suggestions. The GPUs are used frequently and already have heavier system-call-intensive CPU load. However, since the issue occurs across different machines, we don't suspect a structural hardware defect. It's more likely a bug in the NVIDIA stack, possibly triggered by concurrent sys/debug usage.

At times, we even had the query interval down to one second, but the issue doesn't reproduce reliably, not over several hours or a few days.

It seems like no one else has encountered this so far.
We'll keep investigating — thanks again!

Best
Anna


On 5/7/25 18:15, Oleg Drokin wrote:
Hello!

"An uncorrectable ECC error detected" does sound like there's some
hardware problem, while it is strange you only get this on GPU nodes
(Extra power load leading to higher chances of memory corruption + more
frequent kernel memory scannong increasing the chance to hit such
curruption?) I'd expect you'd be seeing other crashes on such GPU nodes
.

Can you just generate some other cpu load (that involves system calls)
on those nodes perhaps and see if suddenly crashes go up as well, just
in some other area?

On Wed, 2025-05-07 at 17:23 +0200, Anna Fuchs via lustre-discuss wrote:
Dear all, We're facing an issue that is hopefully not directly related to
Lustre itself (we're not using community Lustre), but maybe someone
here has seen something similar or knows someone who has.
On our GPU partition with A100-SXM4-80GB GPUs (VBIOS version:
92.00.36.00.02), we’re trying to read IOPS statistics (osc_stats) via
the files under /sys/kernel/debug/lustre/osc/ (we’re running 160
OSTs, Lustre version 2.14.0_ddn184). Our goal is to sample the data
at 5-second intervals, then aggregate and postprocess it into
readable metrics.
We have a collectd daemon running, which had been stable for a long
time. After integrating the IOPS metric, however, we occasionally hit
a kernel panic (see crash dump excerpts below). The issue appears to
originate somewhere in the GPU firmware stack, but we're unsure why
this happens and how it's related to reading Lustre metrics.
The problem occurs often, but is hard to reproduce and happens at
random. We’re hesitant to run the scripts frequently since a crash
could interrupt critical GPU workloads. That said, limited test runs
over several hours often work fine, especially after a fresh reboot.
The CPU-only nodes run the same scripts without issues all the time.
Could this be a sign that /sys/kernel/debug is being overwhelmed
somehow? Although that shouldn’t normally cause a kernel panic.
We’d appreciate any insights, experiences, or pointers, even indirect
ones.
Thanks in advance! Anna 2024-12-17 17:11:28 [2453606.802826] NVRM: Xid (PCI:0000:03:00): 120,
pid='<unknown>', name=<unknown>, GSP task timeout @ pc:0x4bd36c4,
task:
1
2024-12-17 17:11:28 [2453606.802835] NVRM:     Reported by libos
task:0 v2.0 [0] @ ts:1734451888
2024-12-17 17:11:28 [2453606.802837] NVRM:     RISC-V CSR State:
2024-12-17 17:11:28 [2453606.802840] NVRM:
mstatus:0x000000001e000000  mscratch:0x0000000000000000
mie:0x0000000000000880  mip:0x
0000000000000000
2024-12-17 17:11:28 [2453606.802842] NVRM:
mepc:0x0000000004bd36c4  mbadaddr:0x00000100badca700
mcause:0x8000000000000007
2024-12-17 17:11:28 [2453606.802844] NVRM:     RISC-V GPR State:
[...]
2024-12-17 17:11:29 [2453606.803121] NVRM: Xid (PCI:0000:03:00): 140,
pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected
(p
ossible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0,
PCIE:0
[...]
2024-12-17 17:30:03 [2454721.362906] Kernel panic - not syncing:
Fatal exception
2024-12-17 17:30:03 [2454721.611822] Kernel Offset: 0x5200000 from
0xffffffff81000000 (relocation range: 0xffffffff80000000-
0xffffffffbfffffff)
2024-12-17 17:30:03 [2454721.770927] ---[ end Kernel panic - not
syncing: Fatal exception ]---



_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to