Hi, alright, so here’s the output from the NixOS machine:
root@xxx ~ # echo c >/proc/sysrq-trigger client_loop: send disconnect: Broken pipe … root@xxx ~ # journalctl -u ipmi-log.service -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon 2023-03-13 10:25:27 CET. -- Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL. Please allow a few seconds to erase. ... -- Boot fdef496e784e4541abd9ae40df472a0b -- Mar 13 10:25:07 xxx ipmi-log-start[1973]: 1 | 03/13/2023 | 09:12:49 | Event Logging Disabled SEL | Log area reset/cleared | Asserted Mar 13 10:25:07 xxx ipmi-log-start[1973]: 2 | 03/13/2023 | 09:21:06 | Watchdog2 OS Watchdog | Hard reset | Asserted Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL. Please allow a few seconds to erase. The SOL log looks like this: [1107585.917689] sysrq: Trigger a crash [1107585.921272] Kernel panic - not syncing: sysrq triggered crash [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G I 5.10.159 #1-NixOS [1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 1.11.0 07/23/2012 [1107585.943058] Call Trace: [1107585.945680] dump_stack+0x6b/0x83 [1107585.949158] panic+0x101/0x2c8 [1107585.952379] ? printk+0x58/0x73 [1107585.955687] sysrq_handle_crash+0x16/0x20 [1107585.959859] __handle_sysrq.cold+0x43/0x11a [1107585.964203] write_sysrq_trigger+0x24/0x40 [1107585.968463] proc_reg_write+0x51/0x90 [1107585.972290] vfs_write+0xc3/0x280 [1107585.975768] ksys_write+0x5f/0xe0 [1107585.979248] do_syscall_64+0x33/0x40 [1107585.982987] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [1107585.988199] RIP: 0033:0x7f5873932133 [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5 [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5873932133 [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: 0000000000000001 [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: 00007f58739c2f40 [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: 0000000000000002 [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: 0000000000000002 Nothing obvious to me here … if you have any further ideas what to test, let me know. I should be more responsive again now. Thanks and kind regards, Christian > On 5. Mar 2023, at 23:53, Corey Minyard <[email protected]> wrote: > > On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote: >> I’m going to actually attach a serial console to watch the “echo c” panic, >> maybe that gives _some_ indication. >> >> Otherwise: I can quickly run patches on the kernel there to try out things. >> (And the funding offer still stands.) > > Any news on this? I'm curious what this could be. > > -corey > >> >> Christian >> >>> On 1. Mar 2023, at 17:58, Corey Minyard <[email protected]> wrote: >>> >>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote: >>>> Thanks, both machines report: >>>> >>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op >>>> string >>> >>> At this point, I have no idea. I'd have to start adding printks into >>> the code and cause crashes to see what is happing. >>> >>> Maybe something is getting in the way of the panic notifiers and doing >>> something to prevent the IPMI driver from working. >>> >>> -corey >>> >>>> >>>> >>>>> On 28. Feb 2023, at 18:04, Corey Minyard <[email protected]> wrote: >>>>> >>>>> Oh, I forgot. You can look at panic_op in >>>>> /sys/module/ipmi_msghandler/parameters/panic_op >>>>> >>>>> -corey >>>>> >>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via >>>>> Openipmi-developer wrote: >>>>>> Hi, >>>>>> >>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <[email protected]> wrote: >>>>>>> >>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via >>>>>>> Openipmi-developer wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I’ve been trying to debug the PANIC and OEM string handling and am >>>>>>>> running out of ideas whether this is a bug or whether something so >>>>>>>> subtle has changed in my config that I’m just not seeing it. >>>>>>>> >>>>>>>> (Note: I’m willing to pay for consulting.) >>>>>>> >>>>>>> Probably not necessary. >>>>>> >>>>>> Thanks! The offer always stands. If we should ever meet I’m also able to >>>>>> pay in beverages. ;) >>>>>> >>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, (mostly) >>>>>>>> vanilla kernel 4.19.157) to a newer setup (NixOS, (mostly) vanilla >>>>>>>> kernel 5.10.159) and I’m now experiencing crashes that seem to be >>>>>>>> kernel panics but do not get the usual messages in the IPMI SEL. >>>>>>> >>>>>>> I just tested on stock 5.10.159 and it worked without issue. Everything >>>>>>> you have below looks ok. >>>>>>> >>>>>>> Can you test by causing a crash with: >>>>>>> >>>>>>> echo c >/proc/sysrq-trigger >>>>>>> >>>>>>> and see if it works? >>>>>> >>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work. >>>>>> >>>>>>> It sounds like you are having some type of crash that you would normally >>>>>>> use the IPMI logs to debug. However, they aren't perfect, the system >>>>>>> has to stay up long enough to get them into the event log. >>>>>> >>>>>> I think they are staying up long enough because a panic triggers the 255 >>>>>> second bump in the watchdog and only then pass on. However, i’ve also >>>>>> noticed that the kernel _should_ be rebooting after a panic much faster >>>>>> (and not rely on the watchdog) and that doesn’t happen either. (Sorry >>>>>> this just popped from the back of my head). >>>>>> >>>>>>> In this situation, getting a serial console (probably through IPMI >>>>>>> Serial over LAN) and getting the console output on a crash is probably >>>>>>> your best option. You can use ipmitool for this, or I have a library >>>>>>> that is able to make connections to serial ports, including through IPMI >>>>>>> SoL. >>>>>> >>>>>> Yup. Been there, too. :) >>>>>> >>>>>> Unfortunately we’re currently chasing something that pops up very >>>>>> randomly on somewhat odd machines and I also have the feeling that it’s >>>>>> systematically broken right now (as the “echo c” doesn’t work). >>>>>> >>>>>> Thanks a lot, >>>>>> Christian >>>>>> >>>>>> -- >>>>>> Christian Theune · [email protected] · +49 345 219401 0 >>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>>>> Zagrodnick >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Openipmi-developer mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer >>>> >>>> Liebe Grüße, >>>> Christian Theune >>>> >>>> -- >>>> Christian Theune · [email protected] · +49 345 219401 0 >>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>> Zagrodnick >>>> >> >> Liebe Grüße, >> Christian Theune >> >> -- >> Christian Theune · [email protected] · +49 345 219401 0 >> Flying Circus Internet Operations GmbH · https://flyingcircus.io >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >> Zagrodnick >> Liebe Grüße, Christian Theune -- Christian Theune · [email protected] · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick _______________________________________________ Openipmi-developer mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/openipmi-developer
