Hi, yeah, the IPMI log is fine. This is a 10 minute interval job in our system that exports the log and clears it:
The job looks like this: /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool sel elist /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool sel clear So it’s not atomic but it runs after the boot and the elist should output it properly … at least it did in the past. ;) As I said - I’m happy to run any patches you have. If you point me to a git branch somewhere I can switch that system easily. Cheers, Christian > On 13. Mar 2023, at 13:58, Corey Minyard <[email protected]> wrote: > > On Mon, Mar 13, 2023 at 10:27:51AM +0100, Christian Theune wrote: >> Hi, >> >> alright, so here’s the output from the NixOS machine: >> >> root@xxx ~ # echo c >/proc/sysrq-trigger >> client_loop: send disconnect: Broken pipe >> … >> >> root@xxx ~ # journalctl -u ipmi-log.service >> -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon 2023-03-13 >> 10:25:27 CET. -- >> Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL. Please allow a >> few seconds to erase. >> ... >> -- Boot fdef496e784e4541abd9ae40df472a0b -- >> Mar 13 10:25:07 xxx ipmi-log-start[1973]: 1 | 03/13/2023 | 09:12:49 | >> Event Logging Disabled SEL | Log area reset/cleared | Asserted >> Mar 13 10:25:07 xxx ipmi-log-start[1973]: 2 | 03/13/2023 | 09:21:06 | >> Watchdog2 OS Watchdog | Hard reset | Asserted >> Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL. Please allow a few >> seconds to erase. > > Hmm, the SEL got cleared. That would clear out any of the logs that > were issued before that time. I'm not sure when the above happened > verses the crash, though. It looks like it occurred as part of the > reboot, but I'm not sure what I'm seeing. Maybe you have a startup > process that clears the SEL? > > Assuming that's not the issue, what you have looks ok. I'd need to add > some logs to the kernel to see if the log operation ever happens. > > -corey > >> >> The SOL log looks like this: >> >> >> [1107585.917689] sysrq: Trigger a crash >> [1107585.921272] Kernel panic - not syncing: sysrq triggered crash >> [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G I >> 5.10.159 #1-NixOS >> [1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 1.11.0 >> 07/23/2012 >> [1107585.943058] Call Trace: >> [1107585.945680] dump_stack+0x6b/0x83 >> [1107585.949158] panic+0x101/0x2c8 >> [1107585.952379] ? printk+0x58/0x73 >> [1107585.955687] sysrq_handle_crash+0x16/0x20 >> [1107585.959859] __handle_sysrq.cold+0x43/0x11a >> [1107585.964203] write_sysrq_trigger+0x24/0x40 >> [1107585.968463] proc_reg_write+0x51/0x90 >> [1107585.972290] vfs_write+0xc3/0x280 >> [1107585.975768] ksys_write+0x5f/0xe0 >> [1107585.979248] do_syscall_64+0x33/0x40 >> [1107585.982987] entry_SYSCALL_64_after_hwframe+0x61/0xc6 >> [1107585.988199] RIP: 0033:0x7f5873932133 >> [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f >> 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> >> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5 >> [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 ORIG_RAX: >> 0000000000000001 >> [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: >> 00007f5873932133 >> [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: >> 0000000000000001 >> [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: >> 00007f58739c2f40 >> [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: >> 0000000000000002 >> [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: >> 0000000000000002 >> >> Nothing obvious to me here … if you have any further ideas what to test, let >> me know. I should be more responsive again now. >> >> Thanks and kind regards, >> Christian >> >>> On 5. Mar 2023, at 23:53, Corey Minyard <[email protected]> wrote: >>> >>> On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote: >>>> I’m going to actually attach a serial console to watch the “echo c” panic, >>>> maybe that gives _some_ indication. >>>> >>>> Otherwise: I can quickly run patches on the kernel there to try out >>>> things. (And the funding offer still stands.) >>> >>> Any news on this? I'm curious what this could be. >>> >>> -corey >>> >>>> >>>> Christian >>>> >>>>> On 1. Mar 2023, at 17:58, Corey Minyard <[email protected]> wrote: >>>>> >>>>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote: >>>>>> Thanks, both machines report: >>>>>> >>>>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op >>>>>> string >>>>> >>>>> At this point, I have no idea. I'd have to start adding printks into >>>>> the code and cause crashes to see what is happing. >>>>> >>>>> Maybe something is getting in the way of the panic notifiers and doing >>>>> something to prevent the IPMI driver from working. >>>>> >>>>> -corey >>>>> >>>>>> >>>>>> >>>>>>> On 28. Feb 2023, at 18:04, Corey Minyard <[email protected]> wrote: >>>>>>> >>>>>>> Oh, I forgot. You can look at panic_op in >>>>>>> /sys/module/ipmi_msghandler/parameters/panic_op >>>>>>> >>>>>>> -corey >>>>>>> >>>>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via >>>>>>> Openipmi-developer wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <[email protected]> wrote: >>>>>>>>> >>>>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via >>>>>>>>> Openipmi-developer wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I’ve been trying to debug the PANIC and OEM string handling and am >>>>>>>>>> running out of ideas whether this is a bug or whether something so >>>>>>>>>> subtle has changed in my config that I’m just not seeing it. >>>>>>>>>> >>>>>>>>>> (Note: I’m willing to pay for consulting.) >>>>>>>>> >>>>>>>>> Probably not necessary. >>>>>>>> >>>>>>>> Thanks! The offer always stands. If we should ever meet I’m also able >>>>>>>> to pay in beverages. ;) >>>>>>>> >>>>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, >>>>>>>>>> (mostly) vanilla kernel 4.19.157) to a newer setup (NixOS, (mostly) >>>>>>>>>> vanilla kernel 5.10.159) and I’m now experiencing crashes that seem >>>>>>>>>> to be kernel panics but do not get the usual messages in the IPMI >>>>>>>>>> SEL. >>>>>>>>> >>>>>>>>> I just tested on stock 5.10.159 and it worked without issue. >>>>>>>>> Everything >>>>>>>>> you have below looks ok. >>>>>>>>> >>>>>>>>> Can you test by causing a crash with: >>>>>>>>> >>>>>>>>> echo c >/proc/sysrq-trigger >>>>>>>>> >>>>>>>>> and see if it works? >>>>>>>> >>>>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work. >>>>>>>> >>>>>>>>> It sounds like you are having some type of crash that you would >>>>>>>>> normally >>>>>>>>> use the IPMI logs to debug. However, they aren't perfect, the system >>>>>>>>> has to stay up long enough to get them into the event log. >>>>>>>> >>>>>>>> I think they are staying up long enough because a panic triggers the >>>>>>>> 255 second bump in the watchdog and only then pass on. However, i’ve >>>>>>>> also noticed that the kernel _should_ be rebooting after a panic much >>>>>>>> faster (and not rely on the watchdog) and that doesn’t happen either. >>>>>>>> (Sorry this just popped from the back of my head). >>>>>>>> >>>>>>>>> In this situation, getting a serial console (probably through IPMI >>>>>>>>> Serial over LAN) and getting the console output on a crash is probably >>>>>>>>> your best option. You can use ipmitool for this, or I have a library >>>>>>>>> that is able to make connections to serial ports, including through >>>>>>>>> IPMI >>>>>>>>> SoL. >>>>>>>> >>>>>>>> Yup. Been there, too. :) >>>>>>>> >>>>>>>> Unfortunately we’re currently chasing something that pops up very >>>>>>>> randomly on somewhat odd machines and I also have the feeling that >>>>>>>> it’s systematically broken right now (as the “echo c” doesn’t work). >>>>>>>> >>>>>>>> Thanks a lot, >>>>>>>> Christian >>>>>>>> >>>>>>>> -- >>>>>>>> Christian Theune · [email protected] · +49 345 219401 0 >>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>>>>>> Zagrodnick >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Openipmi-developer mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer >>>>>> >>>>>> Liebe Grüße, >>>>>> Christian Theune >>>>>> >>>>>> -- >>>>>> Christian Theune · [email protected] · +49 345 219401 0 >>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>>>> Zagrodnick >>>>>> >>>> >>>> Liebe Grüße, >>>> Christian Theune >>>> >>>> -- >>>> Christian Theune · [email protected] · +49 345 219401 0 >>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>> Zagrodnick >>>> >> >> Liebe Grüße, >> Christian Theune >> >> -- >> Christian Theune · [email protected] · +49 345 219401 0 >> Flying Circus Internet Operations GmbH · https://flyingcircus.io >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >> Zagrodnick Liebe Grüße, Christian Theune -- Christian Theune · [email protected] · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick _______________________________________________ Openipmi-developer mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/openipmi-developer
