Hi,

I’ve been trying to debug the PANIC and OEM string handling and am running out 
of ideas whether this is a bug or whether something so subtle has changed in my 
config that I’m just not seeing it.

(Note: I’m willing to pay for consulting.)

I have machines that we’ve moved from an older setup (Gentoo, (mostly) vanilla 
kernel 4.19.157) to a newer setup (NixOS, (mostly) vanilla kernel 5.10.159) and 
I’m now experiencing crashes that seem to be kernel panics but do not get the 
usual messages in the IPMI SEL.

The kernel does include the necessary drivers, the watchdog is active and the 
SEL shows the watchdog action. I have reason to think that it’s a panic because 
the typical behaviour of the timeout jumping to 255 happens.

Here’s the IPMI-related config and cmdline from the old kernel where it works:

BOOT_IMAGE=/kernel-genkernel-x86_64-4.19.157 root=/dev/vgsys/root ro 
rootfstype=ext4 dolvm ipmi_watchdog.timeout=60 igb.InterruptThrottleRate=1 
ixgbe.InterruptThrottleRate=1 console=ttyS2,57600

# CONFIG_ACPI_IPMI is not set
CONFIG_IPMI_HANDLER=y
CONFIG_IPMI_DMI_DECODE=y
CONFIG_IPMI_PANIC_EVENT=y
CONFIG_IPMI_PANIC_STRING=y
CONFIG_IPMI_DEVICE_INTERFACE=y
CONFIG_IPMI_SI=y
# CONFIG_IPMI_SSIF is not set
CONFIG_IPMI_WATCHDOG=y
CONFIG_IPMI_POWEROFF=y

On that system (as everything is statically compiled) the lsmod is empty WRT 
ipmi and the kernel log shows:

[    4.374757] ipmi device interface
[    4.389388] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS
[    4.402087] ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
[    4.413907] ipmi_si: Adding SMBIOS-specified kcs state machine
[    4.425570] ipmi_si IPI0001:00: ipmi_platform: probing via ACPI
[    4.437408] ipmi_si IPI0001:00: [io  0x0ca2] regsize 1 spacing 1 irq 0
[    4.450449] ipmi_si dmi-ipmi-si.0: Removing SMBIOS-specified kcs state 
machine in favor of ACPI
[    4.467818] ipmi_si: Adding ACPI-specified kcs state machine
[    4.479139] ipmi_si: Trying ACPI-specified kcs state machine at i/o address 
0xca2, slave address 0x0, irq 0
[    4.567613] ipmi_si IPI0001:00: The BMC does not support clearing the recv 
irq bit, compensating, but the BMC needs to be fixed.
[    4.617693] ipmi_si IPI0001:00: Found new BMC (man_id: 0x002a7c, prod_id: 
0x0624, dev_id: 0x20)
[    4.671871] ipmi_si IPI0001:00: IPMI kcs interface initialized

And here’s the controller info:

Device ID                 : 32
Device Revision           : 1
Firmware Revision         : 2.24
IPMI Version              : 2.0
Manufacturer ID           : 10876
Manufacturer Name         : Supermicro
Product ID                : 1572 (0x0624)
Product Name              : Unknown (0x624)
Device Available          : yes
Provides Device SDRs      : no
Additional Device Support :
    Sensor Device
    SDR Repository Device
    SEL Device
    FRU Inventory Device
    IPMB Event Receiver
    IPMB Event Generator
    Chassis Device
Aux Firmware Rev Info     :
    0x06
    0x00
    0x00
    0x00

And here’s the NixOS environment where it doesn’t work:

BOOT_IMAGE=/kernels/qy42jhicvvqb0p7x2h0i46b2x0f1w74q-linux-5.10.159-bzImage 
init=/nix/store/qx33nyr0f60y76yzmbgsikxr60lqzdb3-nixos-system-...-21.05pre-git/init
 dolvm ipmi_watchdog.timeout=60 igb.InterruptThrottleRate=1 
ixgbe.InterruptThrottleRate=1 panic=1 boot.panic_on_fail 
systemd.journald.forward_to_console=no systemd.log_target=kmsg 
console=ttyS1,115200 loglevel=7

CONFIG_ACPI_IPMI=m
CONFIG_IPMI_HANDLER=m
CONFIG_IPMI_DMI_DECODE=y
CONFIG_IPMI_PLAT_DATA=y
CONFIG_IPMI_PANIC_EVENT=y
CONFIG_IPMI_PANIC_STRING=y
CONFIG_IPMI_DEVICE_INTERFACE=m
CONFIG_IPMI_SI=m
CONFIG_IPMI_SSIF=m
CONFIG_IPMI_WATCHDOG=m
CONFIG_IPMI_POWEROFF=m

On the newer system this is what appears in the kernel log:

[   22.070935] ipmi device interface
[   22.086353] systemd-modules-load[572]: Inserted module 'ipmi_watchdog'
[   22.904717] ipmi_si: IPMI System Interface driver
[   22.911022] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS
[   22.917393] ipmi_platform: ipmi_si: SMBIOS: io 0xca8 regsize 1 spacing 4 irq 0
[   22.925092] ipmi_si: Adding SMBIOS-specified kcs state machine
[   22.931023] ipmi_si: Trying SMBIOS-specified kcs state machine at i/o 
address 0xca8, slave address 0x20, irq 0
[   23.119892] ipmi_si dmi-ipmi-si.0: IPMI message handler: Found new BMC 
(man_id: 0x0002a2, prod_id: 0x0100, dev_id: 0x20)
[   23.438469] ipmi_si dmi-ipmi-si.0: IPMI kcs interface initialized
[   23.441630] ipmi_ssif: IPMI SSIF Interface driver

And the ipmi-related modules look like this:

ipmi_ssif              40960  0
ipmi_si                73728  1
ipmi_watchdog          32768  1
ipmi_devintf           20480  0
ipmi_msghandler        73728  4 ipmi_devintf,ipmi_si,ipmi_watchdog,ipmi_ssif
i2c_core              102400  5 
drm_kms_helper,i2c_algo_bit,mgag200,ipmi_ssif,drm

In this case it’s a DELL IPMI controller:

Device ID                 : 32
Device Revision           : 0
Firmware Revision         : 1.52
IPMI Version              : 2.0
Manufacturer ID           : 674
Manufacturer Name         : DELL Inc
Product ID                : 256 (0x0100)
Product Name              : Unknown (0x100)
Device Available          : yes
Provides Device SDRs      : yes
Additional Device Support :
    Sensor Device
    SDR Repository Device
    SEL Device
    FRU Inventory Device
    IPMB Event Receiver
    Bridge
    Chassis Device
Aux Firmware Rev Info     :
    0x00
    0x0a
    0x00
    0x00

But the behaviour has been the same on various SuperMicro machines.

So, after running out of ideas what to look for, I’m left with those questions:

1. when I trigger a panic manually via “echo c > /proc/sysrq-trigger” - that 
should also create a panic message that appears in the SEL, right?

2. Is there anything that comes to mind that I could have configured 
incorrectly in the kernel?

3. Or is there anything I can inspect after boot to know which setting the 
“panic_op” has? 

I have running systems with both the old and new setups available that I can 
freely poke to analyze it interactively.

Any help is appreciated as I’ve run out of ideas and (just to make sure) I’m 
happy to pay proper consulting rates (especially happy to support open source 
work).

Liebe Grüße,
Christian Theune

-- 
Christian Theune · c...@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



_______________________________________________
Openipmi-developer mailing list
Openipmi-developer@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to