Hi Corey I crashed a machine on 1st August after issuing 'ipmitool mc reset cold' to reset a BMC. I got a crash dump from this event which I have been analyzing. The crash occurred when the NMI watchdog detected a hard LOCKUP in an interrupt handler:
[144482.968722] CPU: 1 PID: 96220 Comm: process-finder Kdump: loaded Tainted: G W O 6.6.93-1.el8.x86_64 #1 [144482.968724] RIP: 0010:port_outb+0x13/0x20 [ipmi_si] [144482.968735] Code: 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 66 0f af 77 18 89 d0 0f b7 57 28 01 f2 ee <c3> cc cc cc cc 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 [144482.968736] RSP: 0018:ff626798c007ce50 EFLAGS: 00000002 [144482.968737] RAX: 0000000000000000 RBX: ff2e8eaa120b1c00 RCX: ff2e8ee87e860640 [144482.968738] RDX: 0000000000000ca2 RSI: 0000000000000000 RDI: ff2e8ee98e8c0840 [144482.968738] RBP: 0000000000000001 R08: ff2e8ee87e860668 R09: ff626798c007cf08 [144482.968739] R10: 0000000000000006 R11: 000000000000044d R12: 0000000000000000 [144482.968739] R13: ff2e8ee98e8c0800 R14: ffffffffc27ad210 R15: ff626798c007cf00 [144482.968740] FS: 00007fffe8bff700(0000) GS:ff2e8ee87e840000(0000) knlGS:0000000000000000 [144482.968740] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [144482.968741] CR2: 00007ffff7ceb528 CR3: 000000047de9e001 CR4: 0000000000771ee0 [144482.968742] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [144482.968742] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [144482.968743] PKRU: 55555554 [144482.968743] Call Trace: [144482.968745] <IRQ> [144482.968746] kcs_event+0x253/0x960 [ipmi_si] [144482.968751] smi_event_handler+0x5b/0x280 [ipmi_si] [144482.968756] smi_timeout+0x3b/0xc0 [ipmi_si] [144482.968760] ? __pfx_smi_timeout+0x10/0x10 [ipmi_si] [144482.968764] call_timer_fn+0x24/0x130 [144482.968769] __run_timers.part.0+0x1d8/0x280 [144482.968771] ? enqueue_hrtimer+0x35/0x90 [144482.968772] ? __hrtimer_run_queues+0x141/0x2b0 [144482.968772] ? sched_clock+0xc/0x30 [144482.968775] run_timer_softirq+0x26/0x50 [144482.968776] handle_softirqs+0xdd/0x2d0 [144482.968779] irq_exit_rcu+0xa8/0xd0 [144482.968781] sysvec_apic_timer_interrupt+0x6e/0x90 [144482.968784] </IRQ> I was able to reproduce the crash two days ago (12th August) by running 'ipmitool mc reset cold' in a loop with 2 minute sleeps between on identical test hardware running the same kernel version, although so far when I have reproduced the crash I have not been able to get another crash dump. # c=0; while :; do ((c+=1)); echo $(date) - $c; ipmitool mc reset cold; sleep 120; done Tue 12 Aug 07:02:28 EDT 2025 - 1 Sent cold reset command to MC Tue 12 Aug 07:04:28 EDT 2025 - 2 Sent cold reset command to MC Tue 12 Aug 07:06:28 EDT 2025 - 3 Sent cold reset command to MC Tue 12 Aug 07:08:28 EDT 2025 - 4 Sent cold reset command to MC Tue 12 Aug 07:10:28 EDT 2025 - 5 Sent cold reset command to MC Tue 12 Aug 07:12:28 EDT 2025 - 6 Sent cold reset command to MC Tue 12 Aug 07:14:28 EDT 2025 - 7 Sent cold reset command to MC Tue 12 Aug 07:16:28 EDT 2025 - 8 Sent cold reset command to MC Tue 12 Aug 07:18:28 EDT 2025 - 9 Sent cold reset command to MC Tue 12 Aug 07:20:28 EDT 2025 - 10 Sent cold reset command to MC Tue 12 Aug 07:22:28 EDT 2025 - 11 Sent cold reset command to MC Tue 12 Aug 07:24:28 EDT 2025 - 12 Sent cold reset command to MC Tue 12 Aug 07:26:28 EDT 2025 - 13 Sent cold reset command to MC EXIT STATUS 255 I have tried (and so far failed) to reproduce the problem on kernel 6.1.144-1.el8.x86_64, but admittedly I haven't tried very hard yet so that might not be a reliable data point. On the reproducer, I was gathering debug data from the ipmi_si module using 'echo 7 > /sys/module/ipmi_si/parameters/kcs_debug' and was running 'journalctl -f' in a terminal window at the time of the crash, where the terminal buffer is filled up with thousands of lines like this, which were produced as the BMC was resetting: Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: ipmi_kcs_sm: kcs hosed: Not in read state for error2 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: ipmi_kcs_sm: kcs hosed: Not in read state for error2 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: ipmi_kcs_sm: kcs hosed: Not in read state for error2 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: ipmi_kcs_sm: kcs hosed: Not in read state for error2 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1 Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1 I collected some more debug data from the vmcore file collected on 1st August: $ crash --zero_excluded /usr/lib/debug/lib/modules/6.6.93-1.el8.x86_64/vmlinux vmcore ... crash> mod -s ipmi_si MODULE NAME TEXT_BASE SIZE OBJECT FILE ffffffffc27dde80 ipmi_si ffffffffc27ab000 86016 /usr/lib/debug/lib/modules/6.6.93-1.el8.x86_64/kernel/drivers/char/ipmi/ipmi_si.ko.debug crash> struct smi_info 0xff2e8ee98e8c0800 struct smi_info { si_num = 0, intf = 0xff2e8ee98fbaa000, si_sm = 0xff2e8eaa120b1c00, handlers = 0xffffffffc27e4240 <kcs_smi_handlers>, si_lock = { { rlock = { raw_lock = { { val = { counter = 257 }, { locked = 1 '\001', pending = 1 '\001' }, { locked_pending = 257, tail = 0 } } } } } }, waiting_msg = 0x0, curr_msg = 0x0, si_state = SI_NORMAL, io = { inputb = 0xffffffffc27b1940 <port_inb>, outputb = 0xffffffffc27b1970 <port_outb>, addr = 0x0, regspacing = 1, regsize = 1, regshift = 0, addr_space = IPMI_IO_ADDR_SPACE, addr_data = 3234, addr_source = SI_ACPI, addr_info = { acpi_info = { acpi_handle = 0xff2e8ee9891e2f30 } }, io_setup = 0xffffffffc27b1ac0 <ipmi_si_port_setup>, io_cleanup = 0xffffffffc27b1a60 <port_cleanup>, io_size = 2, irq = 0, irq_setup = 0x0, irq_handler_data = 0x0, irq_cleanup = 0x0, slave_addr = 32 ' ', si_type = SI_KCS, dev = 0xff2e8ee98ac6c010 }, oem_data_avail_handler = 0x0, msg_flags = 0 '\000', has_event_buffer = false, req_events = { counter = 0 }, run_to_completion = false, si_timer = { entry = { next = 0xdead000000000122, pprev = 0x0 }, expires = 4439136548, function = 0xffffffffc27ad210 <smi_timeout>, flags = 155189249 }, timer_can_start = true, timer_running = true, last_timeout_jiffies = 4439136547, need_watch = { counter = 0 }, interrupt_disabled = true, supports_event_msg_buff = false, cannot_disable_irq = false, irq_enable_broken = false, in_maintenance_mode = true, got_attn = false, device_id = { device_id = 32 ' ', device_revision = 2 '\002', firmware_revision_1 = 1 '\001', firmware_revision_2 = 0 '\000', ipmi_version = 2 '\002', additional_device_support = 191 '\277', manufacturer_id = 10876, product_id = 7496, aux_firmware_revision = "!\001\000 ", aux_firmware_revision_set = 1 }, dev_group_added = true, stats = {{ counter = 13470 }, { counter = 1809 }, { counter = 358202 }, { counter = 0 }, { counter = 0 }, { counter = 0 }, { counter = 24503 }, { counter = 357924 }, { counter = 0 }, { counter = 0 }, { counter = 0 }}, thread = 0xff2e8eaa82124100, link = { next = 0xffffffffc27dd780 <smi_infos>, prev = 0xffffffffc27dd780 <smi_infos> } } crash> struct si_sm_data 0xff2e8eaa120b1c00 struct si_sm_data { state = KCS_ERROR1, io = 0xff2e8ee98e8c0840, write_data = "\030\001\003\001\000\000&\030@\000\000\000\000\000\000\000\330\002\000\000\000\000\000\000\330\002\000\000\000\000\000\000\b\000\000\000\000\000\000\000\003\000\000\000\004\000\000\000\030\003\000\000\000\000\000\000\030\003\000\000\000\000\000\000\030\003\000\000\000\000\000\000\034\000\000\000\000\000\000\000\034\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\004\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\n\000\000\000\000\000\000\000\n\000\000\000\000\000\000\000\020\000\000\000\000\000\000\001\000\000\000\005\000\000\000\000\020\000\000\000\000\000\000\000\020\000\000\000\000\000\000\000\020\000\000\000\000\000\000E\005\000\000\000\000\000\000E\005\000\000\000\000\000\000\000\020\000\000\000\000\000\000\001\000\000\000\004\000\000\000\000 \000\000\000\000\000\000\000 \000\000\000\000\000\000\000 \000\000\000\000\000\000\000\003\000\000\000\000\000\000\000\003\000\000\000\000\000\000\000"..., write_pos = 0, write_count = 0, orig_write_count = 0, read_data = "\034\002\000 \002\001\000\002\277|*\000H\035!\001\000 \034\000@SDA Temp\000\a-C\374\177\200KF\000\000\006\000\000\000@ -\000\000\000\000\000\000@=\000\000\000\000\000\000@=\000\000\000\000\000\000 \002\000\000\000\000\000\000 \002\000\000\000\000\000\000\b\000\000\000\000\000\000\000\004\000\000\000\004\000\000\000\070\003\000\000\000\000\000\000\070\003\000\000\000\000\000\000\070\003\000\000\000\000\000\000 \000\000\000\000\000\000\000 \000\000\000\000\000\000\000\b\000\000\000\000\000\000\000\004\000\000\000\004\000\000\000X\003\000\000\000\000\000\000X\003\000\000\000\000\000\000X\003\000\000\000\000\000\000D\000\000\000\000\000\000\000D\000\000\000\000\000\000\000\004\000\000\000\000\000\000\000S\345td\004\000\000\000pz\350\320\023u\023\376\070\003\000\000\000\000\000\000\070\003\000\000\000\000\000\000 \000\000\000\000\000\000\000 \000\000\000\000\000\000\000\b\000\000\000\000\000\000\000"..., read_pos = 0, truncated = 0, error_retries = 6, ibf_timeout = 5000000, obf_timeout = 5000000, error0_timeout = 4439151592 } crash> >From the above it looks like, at the time of the crash, the state machine was at KCS_ERROR1 (si_sm_data.state) having at that moment in time handled 6 retries (si_sm_data.error_retries), but having a hosed counter of 24,503 (smi_info.stats[6]). Looking in the smi_event_handler code, I wasn't immediately sure whether a result of SI_SM_HOSED would cause the interrupt handler to keep looping around and not allow other interrupts to fire, but the symptoms might suggest that? Although if that was the case I'm surprised we haven't seen the problem more often, we have lots of machines. My presumption was that this: [144482.968724] RIP: 0010:port_outb+0x13/0x20 [ipmi_si] [144482.968735] Code: 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 66 0f af 77 18 89 d0 0f b7 57 28 01 f2 ee <c3> cc cc cc cc 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 ... as well as the thousands of state transitions I saw when reproducing the problem with debug output, and the hosed counter being very high in the vmcore, suggested that it wasn't actually stuck at a ret instruction (c3) in port_outb, but that's just where RIP was at the point the crash was taken. Happy to collect more info from the vmcore as needed or test patches etc. Best regards Mark
_______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer