Hello I think this is a NVIDIA bug (GSP task)
Better Contact the NVIDIA support or community Tahari.Abdeslam Le mer. 7 mai 2025, 18:20, Oleg Drokin via lustre-discuss < [email protected]> a écrit : > Hello! > > "An uncorrectable ECC error detected" does sound like there's some > hardware problem, while it is strange you only get this on GPU nodes > (Extra power load leading to higher chances of memory corruption + more > frequent kernel memory scannong increasing the chance to hit such > curruption?) I'd expect you'd be seeing other crashes on such GPU nodes > . > > Can you just generate some other cpu load (that involves system calls) > on those nodes perhaps and see if suddenly crashes go up as well, just > in some other area? > > On Wed, 2025-05-07 at 17:23 +0200, Anna Fuchs via lustre-discuss wrote: > > > > Dear all, > > > > We're facing an issue that is hopefully not directly related to > > Lustre itself (we're not using community Lustre), but maybe someone > > here has seen something similar or knows someone who has. > > > > On our GPU partition with A100-SXM4-80GB GPUs (VBIOS version: > > 92.00.36.00.02), we’re trying to read IOPS statistics (osc_stats) via > > the files under /sys/kernel/debug/lustre/osc/ (we’re running 160 > > OSTs, Lustre version 2.14.0_ddn184). Our goal is to sample the data > > at 5-second intervals, then aggregate and postprocess it into > > readable metrics. > > > > We have a collectd daemon running, which had been stable for a long > > time. After integrating the IOPS metric, however, we occasionally hit > > a kernel panic (see crash dump excerpts below). The issue appears to > > originate somewhere in the GPU firmware stack, but we're unsure why > > this happens and how it's related to reading Lustre metrics. > > > > The problem occurs often, but is hard to reproduce and happens at > > random. We’re hesitant to run the scripts frequently since a crash > > could interrupt critical GPU workloads. That said, limited test runs > > over several hours often work fine, especially after a fresh reboot. > > The CPU-only nodes run the same scripts without issues all the time. > > > > > > Could this be a sign that /sys/kernel/debug is being overwhelmed > > somehow? Although that shouldn’t normally cause a kernel panic. > > > > We’d appreciate any insights, experiences, or pointers, even indirect > > ones. > > > > Thanks in advance! > > > > Anna > > > > > > > > > > 2024-12-17 17:11:28 [2453606.802826] NVRM: Xid (PCI:0000:03:00): 120, > > pid='<unknown>', name=<unknown>, GSP task timeout @ pc:0x4bd36c4, > > task: > > 1 > > 2024-12-17 17:11:28 [2453606.802835] NVRM: Reported by libos > > task:0 v2.0 [0] @ ts:1734451888 > > 2024-12-17 17:11:28 [2453606.802837] NVRM: RISC-V CSR State: > > 2024-12-17 17:11:28 [2453606.802840] NVRM: > > mstatus:0x000000001e000000 mscratch:0x0000000000000000 > > mie:0x0000000000000880 mip:0x > > 0000000000000000 > > 2024-12-17 17:11:28 [2453606.802842] NVRM: > > mepc:0x0000000004bd36c4 mbadaddr:0x00000100badca700 > > mcause:0x8000000000000007 > > 2024-12-17 17:11:28 [2453606.802844] NVRM: RISC-V GPR State: > > [...] > > 2024-12-17 17:11:29 [2453606.803121] NVRM: Xid (PCI:0000:03:00): 140, > > pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected > > (p > > ossible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, > > PCIE:0 > > [...] > > 2024-12-17 17:30:03 [2454721.362906] Kernel panic - not syncing: > > Fatal exception > > 2024-12-17 17:30:03 [2454721.611822] Kernel Offset: 0x5200000 from > > 0xffffffff81000000 (relocation range: 0xffffffff80000000- > > 0xffffffffbfffffff) > > 2024-12-17 17:30:03 [2454721.770927] ---[ end Kernel panic - not > > syncing: Fatal exception ]--- > > > > > > > > _______________________________________________ > > lustre-discuss mailing list > > [email protected] > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
