On Mon, Mar 13, 2023 at 02:07:01PM +0100, Christian Theune wrote: > Hi, > > yeah, the IPMI log is fine. This is a 10 minute interval job in our system > that exports the log and clears it: > > The job looks like this: > > /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool sel > elist > /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool sel > clear > > So it’s not atomic but it runs after the boot and the elist should output it > properly … at least it did in the past. ;) > > As I said - I’m happy to run any patches you have. If you point me to a git > branch somewhere I can switch that system easily.
Ok, I have a branch at https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events that has debug tracing. It will print the function for all panic event handlers, their return values, and adds traces in the IPMI panic event handlers. It's a single patch right on top of 6.2; I'm not sure how portable it is to other kernel versions. I can port if you like. Thanks, -corey > > Cheers, > Christian > > > On 13. Mar 2023, at 13:58, Corey Minyard <[email protected]> wrote: > > > > On Mon, Mar 13, 2023 at 10:27:51AM +0100, Christian Theune wrote: > >> Hi, > >> > >> alright, so here’s the output from the NixOS machine: > >> > >> root@xxx ~ # echo c >/proc/sysrq-trigger > >> client_loop: send disconnect: Broken pipe > >> … > >> > >> root@xxx ~ # journalctl -u ipmi-log.service > >> -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon 2023-03-13 > >> 10:25:27 CET. -- > >> Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL. Please allow a > >> few seconds to erase. > >> ... > >> -- Boot fdef496e784e4541abd9ae40df472a0b -- > >> Mar 13 10:25:07 xxx ipmi-log-start[1973]: 1 | 03/13/2023 | 09:12:49 | > >> Event Logging Disabled SEL | Log area reset/cleared | Asserted > >> Mar 13 10:25:07 xxx ipmi-log-start[1973]: 2 | 03/13/2023 | 09:21:06 | > >> Watchdog2 OS Watchdog | Hard reset | Asserted > >> Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL. Please allow a > >> few seconds to erase. > > > > Hmm, the SEL got cleared. That would clear out any of the logs that > > were issued before that time. I'm not sure when the above happened > > verses the crash, though. It looks like it occurred as part of the > > reboot, but I'm not sure what I'm seeing. Maybe you have a startup > > process that clears the SEL? > > > > Assuming that's not the issue, what you have looks ok. I'd need to add > > some logs to the kernel to see if the log operation ever happens. > > > > -corey > > > >> > >> The SOL log looks like this: > >> > >> > >> [1107585.917689] sysrq: Trigger a crash > >> [1107585.921272] Kernel panic - not syncing: sysrq triggered crash > >> [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G I > >> 5.10.159 #1-NixOS > >> [1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS > >> 1.11.0 07/23/2012 > >> [1107585.943058] Call Trace: > >> [1107585.945680] dump_stack+0x6b/0x83 > >> [1107585.949158] panic+0x101/0x2c8 > >> [1107585.952379] ? printk+0x58/0x73 > >> [1107585.955687] sysrq_handle_crash+0x16/0x20 > >> [1107585.959859] __handle_sysrq.cold+0x43/0x11a > >> [1107585.964203] write_sysrq_trigger+0x24/0x40 > >> [1107585.968463] proc_reg_write+0x51/0x90 > >> [1107585.972290] vfs_write+0xc3/0x280 > >> [1107585.975768] ksys_write+0x5f/0xe0 > >> [1107585.979248] do_syscall_64+0x33/0x40 > >> [1107585.982987] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > >> [1107585.988199] RIP: 0033:0x7f5873932133 > >> [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f > >> 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 > >> <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5 > >> [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 ORIG_RAX: > >> 0000000000000001 > >> [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: > >> 00007f5873932133 > >> [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: > >> 0000000000000001 > >> [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: > >> 00007f58739c2f40 > >> [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: > >> 0000000000000002 > >> [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: > >> 0000000000000002 > >> > >> Nothing obvious to me here … if you have any further ideas what to test, > >> let me know. I should be more responsive again now. > >> > >> Thanks and kind regards, > >> Christian > >> > >>> On 5. Mar 2023, at 23:53, Corey Minyard <[email protected]> wrote: > >>> > >>> On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote: > >>>> I’m going to actually attach a serial console to watch the “echo c” > >>>> panic, maybe that gives _some_ indication. > >>>> > >>>> Otherwise: I can quickly run patches on the kernel there to try out > >>>> things. (And the funding offer still stands.) > >>> > >>> Any news on this? I'm curious what this could be. > >>> > >>> -corey > >>> > >>>> > >>>> Christian > >>>> > >>>>> On 1. Mar 2023, at 17:58, Corey Minyard <[email protected]> wrote: > >>>>> > >>>>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote: > >>>>>> Thanks, both machines report: > >>>>>> > >>>>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op > >>>>>> string > >>>>> > >>>>> At this point, I have no idea. I'd have to start adding printks into > >>>>> the code and cause crashes to see what is happing. > >>>>> > >>>>> Maybe something is getting in the way of the panic notifiers and doing > >>>>> something to prevent the IPMI driver from working. > >>>>> > >>>>> -corey > >>>>> > >>>>>> > >>>>>> > >>>>>>> On 28. Feb 2023, at 18:04, Corey Minyard <[email protected]> wrote: > >>>>>>> > >>>>>>> Oh, I forgot. You can look at panic_op in > >>>>>>> /sys/module/ipmi_msghandler/parameters/panic_op > >>>>>>> > >>>>>>> -corey > >>>>>>> > >>>>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via > >>>>>>> Openipmi-developer wrote: > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <[email protected]> wrote: > >>>>>>>>> > >>>>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via > >>>>>>>>> Openipmi-developer wrote: > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> I’ve been trying to debug the PANIC and OEM string handling and am > >>>>>>>>>> running out of ideas whether this is a bug or whether something so > >>>>>>>>>> subtle has changed in my config that I’m just not seeing it. > >>>>>>>>>> > >>>>>>>>>> (Note: I’m willing to pay for consulting.) > >>>>>>>>> > >>>>>>>>> Probably not necessary. > >>>>>>>> > >>>>>>>> Thanks! The offer always stands. If we should ever meet I’m also > >>>>>>>> able to pay in beverages. ;) > >>>>>>>> > >>>>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, > >>>>>>>>>> (mostly) vanilla kernel 4.19.157) to a newer setup (NixOS, > >>>>>>>>>> (mostly) vanilla kernel 5.10.159) and I’m now experiencing crashes > >>>>>>>>>> that seem to be kernel panics but do not get the usual messages in > >>>>>>>>>> the IPMI SEL. > >>>>>>>>> > >>>>>>>>> I just tested on stock 5.10.159 and it worked without issue. > >>>>>>>>> Everything > >>>>>>>>> you have below looks ok. > >>>>>>>>> > >>>>>>>>> Can you test by causing a crash with: > >>>>>>>>> > >>>>>>>>> echo c >/proc/sysrq-trigger > >>>>>>>>> > >>>>>>>>> and see if it works? > >>>>>>>> > >>>>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work. > >>>>>>>> > >>>>>>>>> It sounds like you are having some type of crash that you would > >>>>>>>>> normally > >>>>>>>>> use the IPMI logs to debug. However, they aren't perfect, the > >>>>>>>>> system > >>>>>>>>> has to stay up long enough to get them into the event log. > >>>>>>>> > >>>>>>>> I think they are staying up long enough because a panic triggers the > >>>>>>>> 255 second bump in the watchdog and only then pass on. However, i’ve > >>>>>>>> also noticed that the kernel _should_ be rebooting after a panic > >>>>>>>> much faster (and not rely on the watchdog) and that doesn’t happen > >>>>>>>> either. (Sorry this just popped from the back of my head). > >>>>>>>> > >>>>>>>>> In this situation, getting a serial console (probably through IPMI > >>>>>>>>> Serial over LAN) and getting the console output on a crash is > >>>>>>>>> probably > >>>>>>>>> your best option. You can use ipmitool for this, or I have a > >>>>>>>>> library > >>>>>>>>> that is able to make connections to serial ports, including through > >>>>>>>>> IPMI > >>>>>>>>> SoL. > >>>>>>>> > >>>>>>>> Yup. Been there, too. :) > >>>>>>>> > >>>>>>>> Unfortunately we’re currently chasing something that pops up very > >>>>>>>> randomly on somewhat odd machines and I also have the feeling that > >>>>>>>> it’s systematically broken right now (as the “echo c” doesn’t work). > >>>>>>>> > >>>>>>>> Thanks a lot, > >>>>>>>> Christian > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Christian Theune · [email protected] · +49 345 219401 0 > >>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian > >>>>>>>> Zagrodnick > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Openipmi-developer mailing list > >>>>>>>> [email protected] > >>>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer > >>>>>> > >>>>>> Liebe Grüße, > >>>>>> Christian Theune > >>>>>> > >>>>>> -- > >>>>>> Christian Theune · [email protected] · +49 345 219401 0 > >>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian > >>>>>> Zagrodnick > >>>>>> > >>>> > >>>> Liebe Grüße, > >>>> Christian Theune > >>>> > >>>> -- > >>>> Christian Theune · [email protected] · +49 345 219401 0 > >>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian > >>>> Zagrodnick > >>>> > >> > >> Liebe Grüße, > >> Christian Theune > >> > >> -- > >> Christian Theune · [email protected] · +49 345 219401 0 > >> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian > >> Zagrodnick > > > Liebe Grüße, > Christian Theune > > -- > Christian Theune · [email protected] · +49 345 219401 0 > Flying Circus Internet Operations GmbH · https://flyingcircus.io > Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick > _______________________________________________ Openipmi-developer mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/openipmi-developer
