On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via Openipmi-developer wrote: > Hi, > > I’ve been trying to debug the PANIC and OEM string handling and am running > out of ideas whether this is a bug or whether something so subtle has changed > in my config that I’m just not seeing it. > > (Note: I’m willing to pay for consulting.)
Probably not necessary. > > I have machines that we’ve moved from an older setup (Gentoo, (mostly) > vanilla kernel 4.19.157) to a newer setup (NixOS, (mostly) vanilla kernel > 5.10.159) and I’m now experiencing crashes that seem to be kernel panics but > do not get the usual messages in the IPMI SEL. I just tested on stock 5.10.159 and it worked without issue. Everything you have below looks ok. Can you test by causing a crash with: echo c >/proc/sysrq-trigger and see if it works? It sounds like you are having some type of crash that you would normally use the IPMI logs to debug. However, they aren't perfect, the system has to stay up long enough to get them into the event log. In this situation, getting a serial console (probably through IPMI Serial over LAN) and getting the console output on a crash is probably your best option. You can use ipmitool for this, or I have a library that is able to make connections to serial ports, including through IPMI SoL. -corey > > The kernel does include the necessary drivers, the watchdog is active and the > SEL shows the watchdog action. I have reason to think that it’s a panic > because the typical behaviour of the timeout jumping to 255 happens. > > Here’s the IPMI-related config and cmdline from the old kernel where it works: > > BOOT_IMAGE=/kernel-genkernel-x86_64-4.19.157 root=/dev/vgsys/root ro > rootfstype=ext4 dolvm ipmi_watchdog.timeout=60 igb.InterruptThrottleRate=1 > ixgbe.InterruptThrottleRate=1 console=ttyS2,57600 > > # CONFIG_ACPI_IPMI is not set > CONFIG_IPMI_HANDLER=y > CONFIG_IPMI_DMI_DECODE=y > CONFIG_IPMI_PANIC_EVENT=y > CONFIG_IPMI_PANIC_STRING=y > CONFIG_IPMI_DEVICE_INTERFACE=y > CONFIG_IPMI_SI=y > # CONFIG_IPMI_SSIF is not set > CONFIG_IPMI_WATCHDOG=y > CONFIG_IPMI_POWEROFF=y > > On that system (as everything is statically compiled) the lsmod is empty WRT > ipmi and the kernel log shows: > > [ 4.374757] ipmi device interface > [ 4.389388] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS > [ 4.402087] ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0 > [ 4.413907] ipmi_si: Adding SMBIOS-specified kcs state machine > [ 4.425570] ipmi_si IPI0001:00: ipmi_platform: probing via ACPI > [ 4.437408] ipmi_si IPI0001:00: [io 0x0ca2] regsize 1 spacing 1 irq 0 > [ 4.450449] ipmi_si dmi-ipmi-si.0: Removing SMBIOS-specified kcs state > machine in favor of ACPI > [ 4.467818] ipmi_si: Adding ACPI-specified kcs state machine > [ 4.479139] ipmi_si: Trying ACPI-specified kcs state machine at i/o > address 0xca2, slave address 0x0, irq 0 > [ 4.567613] ipmi_si IPI0001:00: The BMC does not support clearing the recv > irq bit, compensating, but the BMC needs to be fixed. > [ 4.617693] ipmi_si IPI0001:00: Found new BMC (man_id: 0x002a7c, prod_id: > 0x0624, dev_id: 0x20) > [ 4.671871] ipmi_si IPI0001:00: IPMI kcs interface initialized > > And here’s the controller info: > > Device ID : 32 > Device Revision : 1 > Firmware Revision : 2.24 > IPMI Version : 2.0 > Manufacturer ID : 10876 > Manufacturer Name : Supermicro > Product ID : 1572 (0x0624) > Product Name : Unknown (0x624) > Device Available : yes > Provides Device SDRs : no > Additional Device Support : > Sensor Device > SDR Repository Device > SEL Device > FRU Inventory Device > IPMB Event Receiver > IPMB Event Generator > Chassis Device > Aux Firmware Rev Info : > 0x06 > 0x00 > 0x00 > 0x00 > > And here’s the NixOS environment where it doesn’t work: > > BOOT_IMAGE=/kernels/qy42jhicvvqb0p7x2h0i46b2x0f1w74q-linux-5.10.159-bzImage > init=/nix/store/qx33nyr0f60y76yzmbgsikxr60lqzdb3-nixos-system-...-21.05pre-git/init > dolvm ipmi_watchdog.timeout=60 igb.InterruptThrottleRate=1 > ixgbe.InterruptThrottleRate=1 panic=1 boot.panic_on_fail > systemd.journald.forward_to_console=no systemd.log_target=kmsg > console=ttyS1,115200 loglevel=7 > > CONFIG_ACPI_IPMI=m > CONFIG_IPMI_HANDLER=m > CONFIG_IPMI_DMI_DECODE=y > CONFIG_IPMI_PLAT_DATA=y > CONFIG_IPMI_PANIC_EVENT=y > CONFIG_IPMI_PANIC_STRING=y > CONFIG_IPMI_DEVICE_INTERFACE=m > CONFIG_IPMI_SI=m > CONFIG_IPMI_SSIF=m > CONFIG_IPMI_WATCHDOG=m > CONFIG_IPMI_POWEROFF=m > > On the newer system this is what appears in the kernel log: > > [ 22.070935] ipmi device interface > [ 22.086353] systemd-modules-load[572]: Inserted module 'ipmi_watchdog' > [ 22.904717] ipmi_si: IPMI System Interface driver > [ 22.911022] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS > [ 22.917393] ipmi_platform: ipmi_si: SMBIOS: io 0xca8 regsize 1 spacing 4 > irq 0 > [ 22.925092] ipmi_si: Adding SMBIOS-specified kcs state machine > [ 22.931023] ipmi_si: Trying SMBIOS-specified kcs state machine at i/o > address 0xca8, slave address 0x20, irq 0 > [ 23.119892] ipmi_si dmi-ipmi-si.0: IPMI message handler: Found new BMC > (man_id: 0x0002a2, prod_id: 0x0100, dev_id: 0x20) > [ 23.438469] ipmi_si dmi-ipmi-si.0: IPMI kcs interface initialized > [ 23.441630] ipmi_ssif: IPMI SSIF Interface driver > > And the ipmi-related modules look like this: > > ipmi_ssif 40960 0 > ipmi_si 73728 1 > ipmi_watchdog 32768 1 > ipmi_devintf 20480 0 > ipmi_msghandler 73728 4 ipmi_devintf,ipmi_si,ipmi_watchdog,ipmi_ssif > i2c_core 102400 5 > drm_kms_helper,i2c_algo_bit,mgag200,ipmi_ssif,drm > > In this case it’s a DELL IPMI controller: > > Device ID : 32 > Device Revision : 0 > Firmware Revision : 1.52 > IPMI Version : 2.0 > Manufacturer ID : 674 > Manufacturer Name : DELL Inc > Product ID : 256 (0x0100) > Product Name : Unknown (0x100) > Device Available : yes > Provides Device SDRs : yes > Additional Device Support : > Sensor Device > SDR Repository Device > SEL Device > FRU Inventory Device > IPMB Event Receiver > Bridge > Chassis Device > Aux Firmware Rev Info : > 0x00 > 0x0a > 0x00 > 0x00 > > But the behaviour has been the same on various SuperMicro machines. > > So, after running out of ideas what to look for, I’m left with those > questions: > > 1. when I trigger a panic manually via “echo c > /proc/sysrq-trigger” - that > should also create a panic message that appears in the SEL, right? > > 2. Is there anything that comes to mind that I could have configured > incorrectly in the kernel? > > 3. Or is there anything I can inspect after boot to know which setting the > “panic_op” has? > > I have running systems with both the old and new setups available that I can > freely poke to analyze it interactively. > > Any help is appreciated as I’ve run out of ideas and (just to make sure) I’m > happy to pay proper consulting rates (especially happy to support open source > work). > > Liebe Grüße, > Christian Theune > > -- > Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > Flying Circus Internet Operations GmbH · https://flyingcircus.io > Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick > > > > _______________________________________________ > Openipmi-developer mailing list > Openipmi-developer@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/openipmi-developer _______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer