Hi Corey

I crashed a machine on 1st August after issuing 'ipmitool mc reset cold' to
reset a BMC.  I got a crash dump from this event which I have been
analyzing.  The crash occurred when the NMI watchdog detected a hard LOCKUP
in an interrupt handler:

[144482.968722] CPU: 1 PID: 96220 Comm: process-finder Kdump: loaded
Tainted: G        W  O       6.6.93-1.el8.x86_64 #1
[144482.968724] RIP: 0010:port_outb+0x13/0x20 [ipmi_si]
[144482.968735] Code: 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 0f 1f 44 00 00 66 0f af 77 18 89 d0 0f b7 57 28 01 f2 ee
<c3> cc cc cc cc 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
[144482.968736] RSP: 0018:ff626798c007ce50 EFLAGS: 00000002
[144482.968737] RAX: 0000000000000000 RBX: ff2e8eaa120b1c00 RCX:
ff2e8ee87e860640
[144482.968738] RDX: 0000000000000ca2 RSI: 0000000000000000 RDI:
ff2e8ee98e8c0840
[144482.968738] RBP: 0000000000000001 R08: ff2e8ee87e860668 R09:
ff626798c007cf08
[144482.968739] R10: 0000000000000006 R11: 000000000000044d R12:
0000000000000000
[144482.968739] R13: ff2e8ee98e8c0800 R14: ffffffffc27ad210 R15:
ff626798c007cf00
[144482.968740] FS:  00007fffe8bff700(0000) GS:ff2e8ee87e840000(0000)
knlGS:0000000000000000
[144482.968740] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[144482.968741] CR2: 00007ffff7ceb528 CR3: 000000047de9e001 CR4:
0000000000771ee0
[144482.968742] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[144482.968742] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
0000000000000400
[144482.968743] PKRU: 55555554
[144482.968743] Call Trace:
[144482.968745]  <IRQ>
[144482.968746]  kcs_event+0x253/0x960 [ipmi_si]
[144482.968751]  smi_event_handler+0x5b/0x280 [ipmi_si]
[144482.968756]  smi_timeout+0x3b/0xc0 [ipmi_si]
[144482.968760]  ? __pfx_smi_timeout+0x10/0x10 [ipmi_si]
[144482.968764]  call_timer_fn+0x24/0x130
[144482.968769]  __run_timers.part.0+0x1d8/0x280
[144482.968771]  ? enqueue_hrtimer+0x35/0x90
[144482.968772]  ? __hrtimer_run_queues+0x141/0x2b0
[144482.968772]  ? sched_clock+0xc/0x30
[144482.968775]  run_timer_softirq+0x26/0x50
[144482.968776]  handle_softirqs+0xdd/0x2d0
[144482.968779]  irq_exit_rcu+0xa8/0xd0
[144482.968781]  sysvec_apic_timer_interrupt+0x6e/0x90
[144482.968784]  </IRQ>

I was able to reproduce the crash two days ago (12th August) by running
'ipmitool mc reset cold' in a loop with 2 minute sleeps between on
identical test hardware running the same kernel version, although so far
when I have reproduced the crash I have not been able to get another crash
dump.

# c=0; while :; do ((c+=1)); echo $(date) - $c; ipmitool mc reset cold;
sleep 120; done
Tue 12 Aug 07:02:28 EDT 2025 - 1
Sent cold reset command to MC
Tue 12 Aug 07:04:28 EDT 2025 - 2
Sent cold reset command to MC
Tue 12 Aug 07:06:28 EDT 2025 - 3
Sent cold reset command to MC
Tue 12 Aug 07:08:28 EDT 2025 - 4
Sent cold reset command to MC
Tue 12 Aug 07:10:28 EDT 2025 - 5
Sent cold reset command to MC
Tue 12 Aug 07:12:28 EDT 2025 - 6
Sent cold reset command to MC
Tue 12 Aug 07:14:28 EDT 2025 - 7
Sent cold reset command to MC
Tue 12 Aug 07:16:28 EDT 2025 - 8
Sent cold reset command to MC
Tue 12 Aug 07:18:28 EDT 2025 - 9
Sent cold reset command to MC
Tue 12 Aug 07:20:28 EDT 2025 - 10
Sent cold reset command to MC
Tue 12 Aug 07:22:28 EDT 2025 - 11
Sent cold reset command to MC
Tue 12 Aug 07:24:28 EDT 2025 - 12
Sent cold reset command to MC
Tue 12 Aug 07:26:28 EDT 2025 - 13
Sent cold reset command to MC
EXIT STATUS 255

I have tried (and so far failed) to reproduce the problem on
kernel 6.1.144-1.el8.x86_64, but admittedly I haven't tried very hard yet
so that might not be a reliable data point.

On the reproducer, I was gathering debug data from the ipmi_si module using
'echo 7 > /sys/module/ipmi_si/parameters/kcs_debug' and was running
'journalctl -f' in a terminal window at the time of the crash, where the
terminal buffer is filled up with thousands of lines like this, which were
produced as the BMC was resetting:

Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: ipmi_kcs_sm: kcs hosed: Not in
read state for error2
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: ipmi_kcs_sm: kcs hosed: Not in
read state for error2
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: ipmi_kcs_sm: kcs hosed: Not in
read state for error2
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: ipmi_kcs_sm: kcs hosed: Not in
read state for error2
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 7, c9
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 8, c1
Aug 12 07:27:44 kernel: ipmi_si IPI0001:00: KCS: State = 6, c1

I collected some more debug data from the vmcore file collected on 1st
August:

$ crash --zero_excluded
/usr/lib/debug/lib/modules/6.6.93-1.el8.x86_64/vmlinux vmcore
...
crash> mod -s ipmi_si
     MODULE       NAME                              TEXT_BASE         SIZE
 OBJECT FILE
ffffffffc27dde80  ipmi_si                        ffffffffc27ab000    86016
 
/usr/lib/debug/lib/modules/6.6.93-1.el8.x86_64/kernel/drivers/char/ipmi/ipmi_si.ko.debug


crash> struct smi_info 0xff2e8ee98e8c0800
struct smi_info {
  si_num = 0,
  intf = 0xff2e8ee98fbaa000,
  si_sm = 0xff2e8eaa120b1c00,
  handlers = 0xffffffffc27e4240 <kcs_smi_handlers>,
  si_lock = {
    {
      rlock = {
        raw_lock = {
          {
            val = {
              counter = 257
            },
            {
              locked = 1 '\001',
              pending = 1 '\001'
            },
            {
              locked_pending = 257,
              tail = 0
            }
          }
        }
      }
    }
  },
  waiting_msg = 0x0,
  curr_msg = 0x0,
  si_state = SI_NORMAL,
  io = {
    inputb = 0xffffffffc27b1940 <port_inb>,
    outputb = 0xffffffffc27b1970 <port_outb>,
    addr = 0x0,
    regspacing = 1,
    regsize = 1,
    regshift = 0,
    addr_space = IPMI_IO_ADDR_SPACE,
    addr_data = 3234,
    addr_source = SI_ACPI,
    addr_info = {
      acpi_info = {
        acpi_handle = 0xff2e8ee9891e2f30
      }
    },
    io_setup = 0xffffffffc27b1ac0 <ipmi_si_port_setup>,
    io_cleanup = 0xffffffffc27b1a60 <port_cleanup>,
    io_size = 2,
    irq = 0,
    irq_setup = 0x0,
    irq_handler_data = 0x0,
    irq_cleanup = 0x0,
    slave_addr = 32 ' ',
    si_type = SI_KCS,
    dev = 0xff2e8ee98ac6c010
  },
  oem_data_avail_handler = 0x0,
  msg_flags = 0 '\000',
  has_event_buffer = false,
  req_events = {
    counter = 0
  },
  run_to_completion = false,
  si_timer = {
    entry = {
      next = 0xdead000000000122,
      pprev = 0x0
    },
    expires = 4439136548,
    function = 0xffffffffc27ad210 <smi_timeout>,
    flags = 155189249
  },
  timer_can_start = true,
  timer_running = true,
  last_timeout_jiffies = 4439136547,
  need_watch = {
    counter = 0
  },
  interrupt_disabled = true,
  supports_event_msg_buff = false,
  cannot_disable_irq = false,
  irq_enable_broken = false,
  in_maintenance_mode = true,
  got_attn = false,
  device_id = {
    device_id = 32 ' ',
    device_revision = 2 '\002',
    firmware_revision_1 = 1 '\001',
    firmware_revision_2 = 0 '\000',
    ipmi_version = 2 '\002',
    additional_device_support = 191 '\277',
    manufacturer_id = 10876,
    product_id = 7496,
    aux_firmware_revision = "!\001\000 ",
    aux_firmware_revision_set = 1
  },
  dev_group_added = true,
  stats = {{
      counter = 13470
    }, {
      counter = 1809
    }, {
      counter = 358202
    }, {
      counter = 0
    }, {
      counter = 0
    }, {
      counter = 0
    }, {
      counter = 24503
    }, {
      counter = 357924
    }, {
      counter = 0
    }, {
      counter = 0
    }, {
      counter = 0
    }},
  thread = 0xff2e8eaa82124100,
  link = {
    next = 0xffffffffc27dd780 <smi_infos>,
    prev = 0xffffffffc27dd780 <smi_infos>
  }
}

crash> struct si_sm_data 0xff2e8eaa120b1c00
struct si_sm_data {
  state = KCS_ERROR1,
  io = 0xff2e8ee98e8c0840,
  write_data = 
"\030\001\003\001\000\000&\030@\000\000\000\000\000\000\000\330\002\000\000\000\000\000\000\330\002\000\000\000\000\000\000\b\000\000\000\000\000\000\000\003\000\000\000\004\000\000\000\030\003\000\000\000\000\000\000\030\003\000\000\000\000\000\000\030\003\000\000\000\000\000\000\034\000\000\000\000\000\000\000\034\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\004\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\n\000\000\000\000\000\000\000\n\000\000\000\000\000\000\000\020\000\000\000\000\000\000\001\000\000\000\005\000\000\000\000\020\000\000\000\000\000\000\000\020\000\000\000\000\000\000\000\020\000\000\000\000\000\000E\005\000\000\000\000\000\000E\005\000\000\000\000\000\000\000\020\000\000\000\000\000\000\001\000\000\000\004\000\000\000\000
\000\000\000\000\000\000\000 \000\000\000\000\000\000\000
\000\000\000\000\000\000\000\003\000\000\000\000\000\000\000\003\000\000\000\000\000\000\000"...,
  write_pos = 0,
  write_count = 0,
  orig_write_count = 0,
  read_data = "\034\002\000 \002\001\000\002\277|*\000H\035!\001\000
\034\000@SDA Temp\000\a-C\374\177\200KF\000\000\006\000\000\000@
-\000\000\000\000\000\000@=\000\000\000\000\000\000@=\000\000\000\000\000\000
\002\000\000\000\000\000\000
\002\000\000\000\000\000\000\b\000\000\000\000\000\000\000\004\000\000\000\004\000\000\000\070\003\000\000\000\000\000\000\070\003\000\000\000\000\000\000\070\003\000\000\000\000\000\000
\000\000\000\000\000\000\000
\000\000\000\000\000\000\000\b\000\000\000\000\000\000\000\004\000\000\000\004\000\000\000X\003\000\000\000\000\000\000X\003\000\000\000\000\000\000X\003\000\000\000\000\000\000D\000\000\000\000\000\000\000D\000\000\000\000\000\000\000\004\000\000\000\000\000\000\000S\345td\004\000\000\000pz\350\320\023u\023\376\070\003\000\000\000\000\000\000\070\003\000\000\000\000\000\000
\000\000\000\000\000\000\000
\000\000\000\000\000\000\000\b\000\000\000\000\000\000\000"...,
  read_pos = 0,
  truncated = 0,
  error_retries = 6,
  ibf_timeout = 5000000,
  obf_timeout = 5000000,
  error0_timeout = 4439151592
}
crash>

>From the above it looks like, at the time of the crash, the state machine
was at KCS_ERROR1 (si_sm_data.state) having at that moment in time handled
6 retries (si_sm_data.error_retries), but having a hosed counter of 24,503
(smi_info.stats[6]).

Looking in the smi_event_handler code, I wasn't immediately sure whether a
result of SI_SM_HOSED would cause the interrupt handler to keep looping
around and not allow other interrupts to fire, but the symptoms might
suggest that?  Although if that was the case I'm surprised we haven't seen
the problem more often, we have lots of machines.

My presumption was that this:

[144482.968724] RIP: 0010:port_outb+0x13/0x20 [ipmi_si]
[144482.968735] Code: 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 0f 1f 44 00 00 66 0f af 77 18 89 d0 0f b7 57 28 01 f2 ee
<c3> cc cc cc cc 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90

... as well as the thousands of state transitions I saw when reproducing
the problem with debug output, and the hosed counter being very high in the
vmcore, suggested that it wasn't actually stuck at a ret instruction (c3)
in port_outb, but that's just where RIP was at the point the crash was
taken.

Happy to collect more info from the vmcore as needed or test patches etc.

Best regards
Mark
_______________________________________________
Openipmi-developer mailing list
Openipmi-developer@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to