On Mon, Mar 13, 2023 at 10:27:51AM +0100, Christian Theune wrote: > Hi, > > alright, so here’s the output from the NixOS machine: > > root@xxx ~ # echo c >/proc/sysrq-trigger > client_loop: send disconnect: Broken pipe > … > > root@xxx ~ # journalctl -u ipmi-log.service > -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon 2023-03-13 > 10:25:27 CET. -- > Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL. Please allow a few > seconds to erase. > ... > -- Boot fdef496e784e4541abd9ae40df472a0b -- > Mar 13 10:25:07 xxx ipmi-log-start[1973]: 1 | 03/13/2023 | 09:12:49 | > Event Logging Disabled SEL | Log area reset/cleared | Asserted > Mar 13 10:25:07 xxx ipmi-log-start[1973]: 2 | 03/13/2023 | 09:21:06 | > Watchdog2 OS Watchdog | Hard reset | Asserted > Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL. Please allow a few > seconds to erase.
Hmm, the SEL got cleared. That would clear out any of the logs that were issued before that time. I'm not sure when the above happened verses the crash, though. It looks like it occurred as part of the reboot, but I'm not sure what I'm seeing. Maybe you have a startup process that clears the SEL? Assuming that's not the issue, what you have looks ok. I'd need to add some logs to the kernel to see if the log operation ever happens. -corey > > The SOL log looks like this: > > > [1107585.917689] sysrq: Trigger a crash > [1107585.921272] Kernel panic - not syncing: sysrq triggered crash > [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G I > 5.10.159 #1-NixOS > [1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 1.11.0 > 07/23/2012 > [1107585.943058] Call Trace: > [1107585.945680] dump_stack+0x6b/0x83 > [1107585.949158] panic+0x101/0x2c8 > [1107585.952379] ? printk+0x58/0x73 > [1107585.955687] sysrq_handle_crash+0x16/0x20 > [1107585.959859] __handle_sysrq.cold+0x43/0x11a > [1107585.964203] write_sysrq_trigger+0x24/0x40 > [1107585.968463] proc_reg_write+0x51/0x90 > [1107585.972290] vfs_write+0xc3/0x280 > [1107585.975768] ksys_write+0x5f/0xe0 > [1107585.979248] do_syscall_64+0x33/0x40 > [1107585.982987] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > [1107585.988199] RIP: 0033:0x7f5873932133 > [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f > 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> > 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5 > [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 ORIG_RAX: > 0000000000000001 > [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: > 00007f5873932133 > [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: > 0000000000000001 > [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: > 00007f58739c2f40 > [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: > 0000000000000002 > [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: > 0000000000000002 > > Nothing obvious to me here … if you have any further ideas what to test, let > me know. I should be more responsive again now. > > Thanks and kind regards, > Christian > > > On 5. Mar 2023, at 23:53, Corey Minyard <miny...@acm.org> wrote: > > > > On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote: > >> I’m going to actually attach a serial console to watch the “echo c” panic, > >> maybe that gives _some_ indication. > >> > >> Otherwise: I can quickly run patches on the kernel there to try out > >> things. (And the funding offer still stands.) > > > > Any news on this? I'm curious what this could be. > > > > -corey > > > >> > >> Christian > >> > >>> On 1. Mar 2023, at 17:58, Corey Minyard <miny...@acm.org> wrote: > >>> > >>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote: > >>>> Thanks, both machines report: > >>>> > >>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op > >>>> string > >>> > >>> At this point, I have no idea. I'd have to start adding printks into > >>> the code and cause crashes to see what is happing. > >>> > >>> Maybe something is getting in the way of the panic notifiers and doing > >>> something to prevent the IPMI driver from working. > >>> > >>> -corey > >>> > >>>> > >>>> > >>>>> On 28. Feb 2023, at 18:04, Corey Minyard <miny...@acm.org> wrote: > >>>>> > >>>>> Oh, I forgot. You can look at panic_op in > >>>>> /sys/module/ipmi_msghandler/parameters/panic_op > >>>>> > >>>>> -corey > >>>>> > >>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via > >>>>> Openipmi-developer wrote: > >>>>>> Hi, > >>>>>> > >>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <miny...@acm.org> wrote: > >>>>>>> > >>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via > >>>>>>> Openipmi-developer wrote: > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I’ve been trying to debug the PANIC and OEM string handling and am > >>>>>>>> running out of ideas whether this is a bug or whether something so > >>>>>>>> subtle has changed in my config that I’m just not seeing it. > >>>>>>>> > >>>>>>>> (Note: I’m willing to pay for consulting.) > >>>>>>> > >>>>>>> Probably not necessary. > >>>>>> > >>>>>> Thanks! The offer always stands. If we should ever meet I’m also able > >>>>>> to pay in beverages. ;) > >>>>>> > >>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, > >>>>>>>> (mostly) vanilla kernel 4.19.157) to a newer setup (NixOS, (mostly) > >>>>>>>> vanilla kernel 5.10.159) and I’m now experiencing crashes that seem > >>>>>>>> to be kernel panics but do not get the usual messages in the IPMI > >>>>>>>> SEL. > >>>>>>> > >>>>>>> I just tested on stock 5.10.159 and it worked without issue. > >>>>>>> Everything > >>>>>>> you have below looks ok. > >>>>>>> > >>>>>>> Can you test by causing a crash with: > >>>>>>> > >>>>>>> echo c >/proc/sysrq-trigger > >>>>>>> > >>>>>>> and see if it works? > >>>>>> > >>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work. > >>>>>> > >>>>>>> It sounds like you are having some type of crash that you would > >>>>>>> normally > >>>>>>> use the IPMI logs to debug. However, they aren't perfect, the system > >>>>>>> has to stay up long enough to get them into the event log. > >>>>>> > >>>>>> I think they are staying up long enough because a panic triggers the > >>>>>> 255 second bump in the watchdog and only then pass on. However, i’ve > >>>>>> also noticed that the kernel _should_ be rebooting after a panic much > >>>>>> faster (and not rely on the watchdog) and that doesn’t happen either. > >>>>>> (Sorry this just popped from the back of my head). > >>>>>> > >>>>>>> In this situation, getting a serial console (probably through IPMI > >>>>>>> Serial over LAN) and getting the console output on a crash is probably > >>>>>>> your best option. You can use ipmitool for this, or I have a library > >>>>>>> that is able to make connections to serial ports, including through > >>>>>>> IPMI > >>>>>>> SoL. > >>>>>> > >>>>>> Yup. Been there, too. :) > >>>>>> > >>>>>> Unfortunately we’re currently chasing something that pops up very > >>>>>> randomly on somewhat odd machines and I also have the feeling that > >>>>>> it’s systematically broken right now (as the “echo c” doesn’t work). > >>>>>> > >>>>>> Thanks a lot, > >>>>>> Christian > >>>>>> > >>>>>> -- > >>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > >>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian > >>>>>> Zagrodnick > >>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Openipmi-developer mailing list > >>>>>> Openipmi-developer@lists.sourceforge.net > >>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer > >>>> > >>>> Liebe Grüße, > >>>> Christian Theune > >>>> > >>>> -- > >>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > >>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian > >>>> Zagrodnick > >>>> > >> > >> Liebe Grüße, > >> Christian Theune > >> > >> -- > >> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > >> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian > >> Zagrodnick > >> > > Liebe Grüße, > Christian Theune > > -- > Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > Flying Circus Internet Operations GmbH · https://flyingcircus.io > Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick > _______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer