Dear all,

We're facing an issue that is hopefully not directly related to Lustre itself (we're not using community Lustre), but maybe someone here has seen something similar or knows someone who has.

On our GPU partition with A100-SXM4-80GB GPUs (VBIOS version: |92.00.36.00.02|), we’re trying to read IOPS statistics (osc_stats) via the files under |/sys/kernel/debug/lustre/osc/| (we’re running 160 OSTs, Lustre version |2.14.0_ddn184|). Our goal is to sample the data at 5-second intervals, then aggregate and postprocess it into readable metrics.

We have a collectd daemon running, which had been stable for a long time. After integrating the IOPS metric, however, we occasionally hit a kernel panic (see crash dump excerpts below). The issue appears to originate somewhere in the GPU firmware stack, but we're unsure why this happens and how it's related to reading Lustre metrics.

The problem occurs often, but is hard to reproduce and happens at random. We’re hesitant to run the scripts frequently since a crash could interrupt critical GPU workloads. That said, limited test runs over several hours often work fine, especially after a fresh reboot. The CPU-only nodes run the same scripts without issues all the time.

Could this be a sign that |/sys/kernel/debug| is being overwhelmed somehow? Although that shouldn’t normally cause a kernel panic.

We’d appreciate any insights, experiences, or pointers, even indirect ones.

Thanks in advance!

Anna


|2024-12-17 17:11:28 [2453606.802826] NVRM: Xid (PCI:0000:03:00): 120, pid='<unknown>', name=<unknown>, GSP task timeout @ pc:0x4bd36c4, task: 1 2024-12-17 17:11:28 [2453606.802835] NVRM: Reported by libos task:0 v2.0 [0] @ ts:1734451888 2024-12-17 17:11:28 [2453606.802837] NVRM: RISC-V CSR State: 2024-12-17 17:11:28 [2453606.802840] NVRM: mstatus:0x000000001e000000 mscratch:0x0000000000000000 mie:0x0000000000000880 mip:0x 0000000000000000 2024-12-17 17:11:28 [2453606.802842] NVRM: mepc:0x0000000004bd36c4 mbadaddr:0x00000100badca700 mcause:0x8000000000000007 2024-12-17 17:11:28 [2453606.802844] NVRM: RISC-V GPR State: [...] 2024-12-17 17:11:29 [2453606.803121] NVRM: Xid (PCI:0000:03:00): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (p ossible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, PCIE:0 [...] 2024-12-17 17:30:03 [2454721.362906] Kernel panic - not syncing: Fatal exception 2024-12-17 17:30:03 [2454721.611822] Kernel Offset: 0x5200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) 2024-12-17 17:30:03 [2454721.770927] ---[ end Kernel panic - not syncing: Fatal exception ]--- -- Anna Fuchs Universität Hamburg / Deutsches Klimarechenzentrum GmbH (DKRZ) |
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to