On Wed, Aug 06, 2025 at 05:51:29PM -0500, Corey Minyard wrote: > On Wed, Aug 06, 2025 at 04:36:41PM -0500, Frederick Lawler wrote: > > On Wed, Aug 06, 2025 at 04:16:18PM -0500, Corey Minyard wrote: > > > On Wed, Aug 06, 2025 at 03:19:02PM -0500, Fred Lawler wrote: > > > > + CC: Corey Minyard <co...@minyard.net> > > > > > > > > > I'm wondering if something is happening with the BMC resetting and > > > interactions with ACPI involved in that. Adding the extra part of > > > trying to talk to the BMC while it's being reset could cause the BMC to > > > get confused and do bad things? > > > > > > > Sure, it's a possibility we explored. We have a lot of automation. > > Predominately of which is a prometheus module exporting IPMI information > > from the sysfs files. And we also have config management that's querying > > sysfs files to regulate updates etc... Sometimes, the config management > > automation will attempt to reset the BMC. > > Ok. I have tests that do BMC resets, but I can't run at the scale you > do, and I'm running in a simulator so it's not going to be have the > same. > > The other possibility is the processor goes into the idle code while > interrupts are off, but I think the kernel has checks all around that. > I can't think of how else a processor would get stuck in idle. >
Yes, it's a bit of an odd case. There's nothing obvious reported by the crash utility. By the time we get the NMI/panic, the CPUs are off doing something else in our crash typical case. That said, earlier this week I got a hard lockup outside of a BMC reset, but the node had too many MCE correctable memory errors. For sake of completeness, I'll post that stack trace here anyway since that may provide some more context clues. In this case, I did catch two separate reads to sysfs files, and then they appear to have competed. The cat process seemed to already be off CPU, but the KCS message is still coming in at the same time the python script was being processed too. Only the python run was on CPU at time of crash. But NMI panic was still on a idle CPU. Unfortunately, I didn't write down all the logs this one, so it's missing the idle state NMI for watchdog, but hopefully the snippets show what's happening. I posted this below. > > > > > > > > > > > > I tried also tried to load the CPUs with stress-ng, but the best I > > > > > can do > > > > > are the hung tasks. > > > > > > > > > > I identified that sni_send()[1] could be locked behind the > > > > > spin_lock_irqsave() and within the KCS send handler, there's another > > > > > irq > > > > > save lock. I suspect this is where we're getting hung up. Below is a > > > > > sample stack trace + log output. > > > > > > Yeah, I don't see that in the traceback. There is a lock in the KCS > > > sender, but I don't see how that could do anything. > > > > > > Maybe you could try changing the cpuidle handler? That would be at > > > least something to try. > > > > > > > Would that help in forming a reproducer? I'd need to deploy any kernel > > modifications fleet wide to cast a wide enough net. The lockups arn't > > extremely consistent. We may get a couple or more a week. > > Ah, so this isn't readily reproducable. Bummer. > > If the problem goes away if you change the cpuidle handler to something > non-ACPI, that would be a big clue that it's an ACPI issue. > > > > > Lastly, I have the rate limit patch backported. I'll be able to start > > testing with that tomorrow, and same with loading the IPMI watchdog > > module. > > Ok. I don't have much hope for it making much difference, but it's safe > and will be coming in the next kernel release. > It occurred to me last night that I'd probably like a rate limit on the KCS messages as well. I didn't see if a patch for that was made. I can whip that up sometime next week, that could be of use to anyone. [1533534.869508] [Hardware Error]: Corrected error, no action required. [1533534.884635] [Hardware Error]: CPU:1 (17:31:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b [1533534.912122] [Hardware Error]: Error Addr: 0x0000000313c7a020 [1533534.926641] [Hardware Error]: IPID: 0x0000009600350f00, Syndrome: 0x9fec08000a800a01 [1533534.943278] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0 [1533534.946635] EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#1channel#3 (csrow:1 channel:3 page:0x0 offset:0x0 grain:64 syndrome:0x800) [1533535.369487] INFO: task cat:1844873 blocked for more than 10 seconds. [1533535.385145] Tainted: G W O 6.12.35-cloudflare-2025.6.15 #1 [1533535.401614] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1533535.418715] task:cat state:D stack:0 pid:1844873 tgid:1844873 ppid:1844872 task_flags:0x400000 flags:0x00004002 [1533535.447475] Call Trace: [1533535.458691] <TASK> [1533535.469154] __schedule+0x4fa/0xbf0 [1533535.481433] schedule+0x27/0xf0 [1533535.493181] __get_guid+0xf4/0x130 [ipmi_msghandler] [1533535.506325] ? __pfx_autoremove_wake_function+0x10/0x10 [1533535.519910] __bmc_get_device_id+0xd6/0xa30 [ipmi_msghandler] [1533535.534459] ? srso_return_thunk+0x5/0x5f [1533535.546509] ? srso_return_thunk+0x5/0x5f [1533535.558540] ? __memcg_slab_post_alloc_hook+0x21b/0x410 [1533535.571722] aux_firmware_rev_show+0x38/0x90 [ipmi_msghandler] [1533535.585304] ? __kmalloc_node_noprof+0x3f6/0x450 [1533535.598144] ? seq_read_iter+0x376/0x460 [1533535.609621] dev_attr_show+0x1c/0x40 [1533535.621024] sysfs_kf_seq_show+0x8f/0xe0 [1533535.632316] seq_read_iter+0x11f/0x460 [1533535.643172] ? security_file_permission+0x9/0xb0 [1533535.655102] vfs_read+0x260/0x330 [1533535.665368] ksys_read+0x65/0xe0 [1533535.675559] do_syscall_64+0x4b/0x110 [1533535.686324] entry_SYSCALL_64_after_hwframe+0x76/0x7e [1533535.698530] RIP: 0033:0x7f72b587125d [1533535.708857] RSP: 002b:00007ffccc21bb48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [1533535.723411] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f72b587125d [1533535.737361] RDX: 0000000000020000 RSI: 00007f72b5755000 RDI: 0000000000000003 [1533535.751191] RBP: 0000000000020000 R08: 00000000ffffffff R09: 0000000000000000 [1533535.764847] R10: 00007f72b5788b60 R11: 0000000000000246 R12: 00007f72b5755000 [1533535.778536] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000000000 [1533535.792210] </TASK> crash> bt -l 1781073 PID: 1781073 TASK: ffff9d91c7040000 CPU: 81 COMMAND: "/usr/bin/python" #0 [ffffb3a171683c00] __schedule at ffffffff9d559eea /cfsetup_build/build/linux/kernel/sched/core.c: 5338 #1 [ffffb3a171683c80] schedule at ffffffff9d55a617 /cfsetup_build/build/linux/arch/x86/include/asm/preempt.h: 84 #2 [ffffb3a171683c90] __get_guid at ffffffffc22aa574 [ipmi_msghandler] #3 [ffffb3a171683ce8] __bmc_get_device_id at ffffffffc22aa696 [ipmi_msghandler] #4 [ffffb3a171683da0] aux_firmware_rev_show at ffffffffc22ab1c8 [ipmi_msghandler] #5 [ffffb3a171683dd0] dev_attr_show at ffffffff9d1175dc /cfsetup_build/build/linux/drivers/base/core.c: 2425 #6 [ffffb3a171683de8] sysfs_kf_seq_show at ffffffff9cc64caf /cfsetup_build/build/linux/fs/sysfs/file.c: 60 #7 [ffffb3a171683e10] seq_read_iter at ffffffff9cbddf7f /cfsetup_build/build/linux/fs/seq_file.c: 230 #8 [ffffb3a171683e68] vfs_read at ffffffff9cba8590 /cfsetup_build/build/linux/fs/read_write.c: 489 #9 [ffffb3a171683f00] ksys_read at ffffffff9cba9165 /cfsetup_build/build/linux/fs/read_write.c: 713 #10 [ffffb3a171683f38] do_syscall_64 at ffffffff9d550c8b /cfsetup_build/build/linux/arch/x86/entry/common.c: 52 #11 [ffffb3a171683f50] entry_SYSCALL_64_after_hwframe at ffffffff9d60012f /cfsetup_build/build/linux/arch/x86/entry/entry_64.S: 130 RIP: 00007f04e1b7c29c RSP: 00007ffea7aaf6c0 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000a840f8 RCX: 00007f04e1b7c29c RDX: 0000000000001001 RSI: 000000002fd06ef0 RDI: 00000000000000c1 RBP: 00007f04e1a82fc0 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000001001 R13: 000000002fd06ef0 R14: 00000000000000c1 R15: 0000000000a41520 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b crash> files 1781073 ... 193 ffff9db5132e5800 ffff9dafb18bd200 ffff9da7b780bcf0 REG /sys/devices/platform/ipmi_bmc.0/aux_firmware_revision crash> log -c ... [1533553.998160] [ C7] ipmi_si IPI0001:00: KCS in invalid state 6 [1533554.009156] [ C7] ipmi_si IPI0001:00: KCS in invalid state 8 [1533554.019973] [T1844873] ipmi_si IPI0001:00: KCS in invalid state 9 [1533554.031005] [ C81] ipmi_si IPI0001:00: IPMI message handler: device id fetch failed: 0xd5 _______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer