On Mon, Mar 13, 2023 at 02:07:01PM +0100, Christian Theune wrote:
> Hi,
> 
> yeah, the IPMI log is fine. This is a 10 minute interval job in our system 
> that exports the log and clears it:
> 
> The job looks like this:
> 
> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool sel 
> elist
> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool sel 
> clear
> 
> So it’s not atomic but it runs after the boot and the elist should output it 
> properly … at least it did in the past. ;)
> 
> As I said - I’m happy to run any patches you have. If you point me to a git 
> branch somewhere I can switch that system easily.

Ok, I have a branch at

https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events

that has debug tracing.  It will print the function for all panic event
handlers, their return values, and adds traces in the IPMI panic event
handlers.

It's a single patch right on top of 6.2; I'm not sure how portable it is
to other kernel versions.  I can port if you like.

Thanks,

-corey

>  
> Cheers,
> Christian
> 
> > On 13. Mar 2023, at 13:58, Corey Minyard <[email protected]> wrote:
> > 
> > On Mon, Mar 13, 2023 at 10:27:51AM +0100, Christian Theune wrote:
> >> Hi,
> >> 
> >> alright, so here’s the output from the NixOS machine:
> >> 
> >> root@xxx ~ # echo c >/proc/sysrq-trigger
> >> client_loop: send disconnect: Broken pipe
> >> …
> >> 
> >> root@xxx ~ # journalctl -u ipmi-log.service
> >> -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon 2023-03-13 
> >> 10:25:27 CET. --
> >> Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL.  Please allow a 
> >> few seconds to erase.
> >> ...
> >> -- Boot fdef496e784e4541abd9ae40df472a0b --
> >> Mar 13 10:25:07 xxx ipmi-log-start[1973]:    1 | 03/13/2023 | 09:12:49 | 
> >> Event Logging Disabled SEL | Log area reset/cleared | Asserted
> >> Mar 13 10:25:07 xxx ipmi-log-start[1973]:    2 | 03/13/2023 | 09:21:06 | 
> >> Watchdog2 OS Watchdog | Hard reset | Asserted
> >> Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL.  Please allow a 
> >> few seconds to erase.
> > 
> > Hmm, the SEL got cleared.  That would clear out any of the logs that
> > were issued before that time.  I'm not sure when the above happened
> > verses the crash, though.  It looks like it occurred as part of the
> > reboot, but I'm not sure what I'm seeing.  Maybe you have a startup
> > process that clears the SEL?
> > 
> > Assuming that's not the issue, what you have looks ok.  I'd need to add
> > some logs to the kernel to see if the log operation ever happens.
> > 
> > -corey
> > 
> >> 
> >> The SOL log looks like this:
> >> 
> >> 
> >> [1107585.917689] sysrq: Trigger a crash
> >> [1107585.921272] Kernel panic - not syncing: sysrq triggered crash
> >> [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G          I       
> >> 5.10.159 #1-NixOS
> >> [1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 
> >> 1.11.0 07/23/2012
> >> [1107585.943058] Call Trace:
> >> [1107585.945680]  dump_stack+0x6b/0x83
> >> [1107585.949158]  panic+0x101/0x2c8
> >> [1107585.952379]  ? printk+0x58/0x73
> >> [1107585.955687]  sysrq_handle_crash+0x16/0x20
> >> [1107585.959859]  __handle_sysrq.cold+0x43/0x11a
> >> [1107585.964203]  write_sysrq_trigger+0x24/0x40
> >> [1107585.968463]  proc_reg_write+0x51/0x90
> >> [1107585.972290]  vfs_write+0xc3/0x280
> >> [1107585.975768]  ksys_write+0x5f/0xe0
> >> [1107585.979248]  do_syscall_64+0x33/0x40
> >> [1107585.982987]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
> >> [1107585.988199] RIP: 0033:0x7f5873932133
> >> [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 
> >> 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 
> >> <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5
> >> [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 ORIG_RAX: 
> >> 0000000000000001
> >> [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 
> >> 00007f5873932133
> >> [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: 
> >> 0000000000000001
> >> [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: 
> >> 00007f58739c2f40
> >> [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: 
> >> 0000000000000002
> >> [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: 
> >> 0000000000000002
> >> 
> >> Nothing obvious to me here … if you have any further ideas what to test, 
> >> let me know. I should be more responsive again now.
> >> 
> >> Thanks and kind regards,
> >> Christian
> >> 
> >>> On 5. Mar 2023, at 23:53, Corey Minyard <[email protected]> wrote:
> >>> 
> >>> On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote:
> >>>> I’m going to actually attach a serial console to watch the “echo c” 
> >>>> panic, maybe that gives _some_ indication.
> >>>> 
> >>>> Otherwise: I can quickly run patches on the kernel there to try out 
> >>>> things. (And the funding offer still stands.)
> >>> 
> >>> Any news on this?  I'm curious what this could be.
> >>> 
> >>> -corey
> >>> 
> >>>> 
> >>>> Christian
> >>>> 
> >>>>> On 1. Mar 2023, at 17:58, Corey Minyard <[email protected]> wrote:
> >>>>> 
> >>>>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote:
> >>>>>> Thanks, both machines report:
> >>>>>> 
> >>>>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op
> >>>>>> string
> >>>>> 
> >>>>> At this point, I have no idea.  I'd have to start adding printks into
> >>>>> the code and cause crashes to see what is happing.
> >>>>> 
> >>>>> Maybe something is getting in the way of the panic notifiers and doing
> >>>>> something to prevent the IPMI driver from working.
> >>>>> 
> >>>>> -corey
> >>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>>> On 28. Feb 2023, at 18:04, Corey Minyard <[email protected]> wrote:
> >>>>>>> 
> >>>>>>> Oh, I forgot.  You can look at panic_op in 
> >>>>>>> /sys/module/ipmi_msghandler/parameters/panic_op
> >>>>>>> 
> >>>>>>> -corey
> >>>>>>> 
> >>>>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via 
> >>>>>>> Openipmi-developer wrote:
> >>>>>>>> Hi,
> >>>>>>>> 
> >>>>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <[email protected]> wrote:
> >>>>>>>>> 
> >>>>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via 
> >>>>>>>>> Openipmi-developer wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>> 
> >>>>>>>>>> I’ve been trying to debug the PANIC and OEM string handling and am 
> >>>>>>>>>> running out of ideas whether this is a bug or whether something so 
> >>>>>>>>>> subtle has changed in my config that I’m just not seeing it.
> >>>>>>>>>> 
> >>>>>>>>>> (Note: I’m willing to pay for consulting.)
> >>>>>>>>> 
> >>>>>>>>> Probably not necessary.
> >>>>>>>> 
> >>>>>>>> Thanks! The offer always stands. If we should ever meet I’m also 
> >>>>>>>> able to pay in beverages. ;)
> >>>>>>>> 
> >>>>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, 
> >>>>>>>>>> (mostly) vanilla kernel 4.19.157) to a newer setup (NixOS, 
> >>>>>>>>>> (mostly) vanilla kernel 5.10.159) and I’m now experiencing crashes 
> >>>>>>>>>> that seem to be kernel panics but do not get the usual messages in 
> >>>>>>>>>> the IPMI SEL.
> >>>>>>>>> 
> >>>>>>>>> I just tested on stock 5.10.159 and it worked without issue.  
> >>>>>>>>> Everything
> >>>>>>>>> you have below looks ok.
> >>>>>>>>> 
> >>>>>>>>> Can you test by causing a crash with:
> >>>>>>>>> 
> >>>>>>>>> echo c >/proc/sysrq-trigger
> >>>>>>>>> 
> >>>>>>>>> and see if it works?
> >>>>>>>> 
> >>>>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work.
> >>>>>>>> 
> >>>>>>>>> It sounds like you are having some type of crash that you would 
> >>>>>>>>> normally
> >>>>>>>>> use the IPMI logs to debug.  However, they aren't perfect, the 
> >>>>>>>>> system
> >>>>>>>>> has to stay up long enough to get them into the event log.
> >>>>>>>> 
> >>>>>>>> I think they are staying up long enough because a panic triggers the 
> >>>>>>>> 255 second bump in the watchdog and only then pass on. However, i’ve 
> >>>>>>>> also noticed that the kernel _should_ be rebooting after a panic 
> >>>>>>>> much faster (and not rely on the watchdog) and that doesn’t happen 
> >>>>>>>> either. (Sorry this just popped from the back of my head).
> >>>>>>>> 
> >>>>>>>>> In this situation, getting a serial console (probably through IPMI
> >>>>>>>>> Serial over LAN) and getting the console output on a crash is 
> >>>>>>>>> probably
> >>>>>>>>> your best option.  You can use ipmitool for this, or I have a 
> >>>>>>>>> library
> >>>>>>>>> that is able to make connections to serial ports, including through 
> >>>>>>>>> IPMI
> >>>>>>>>> SoL.
> >>>>>>>> 
> >>>>>>>> Yup. Been there, too. :)
> >>>>>>>> 
> >>>>>>>> Unfortunately we’re currently chasing something that pops up very 
> >>>>>>>> randomly on somewhat odd machines and I also have the feeling that 
> >>>>>>>> it’s systematically broken right now (as the “echo c” doesn’t work).
> >>>>>>>> 
> >>>>>>>> Thanks a lot,
> >>>>>>>> Christian
> >>>>>>>> 
> >>>>>>>> -- 
> >>>>>>>> Christian Theune · [email protected] · +49 345 219401 0
> >>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
> >>>>>>>> Zagrodnick
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> _______________________________________________
> >>>>>>>> Openipmi-developer mailing list
> >>>>>>>> [email protected]
> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer
> >>>>>> 
> >>>>>> Liebe Grüße,
> >>>>>> Christian Theune
> >>>>>> 
> >>>>>> -- 
> >>>>>> Christian Theune · [email protected] · +49 345 219401 0
> >>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
> >>>>>> Zagrodnick
> >>>>>> 
> >>>> 
> >>>> Liebe Grüße,
> >>>> Christian Theune
> >>>> 
> >>>> -- 
> >>>> Christian Theune · [email protected] · +49 345 219401 0
> >>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
> >>>> Zagrodnick
> >>>> 
> >> 
> >> Liebe Grüße,
> >> Christian Theune
> >> 
> >> -- 
> >> Christian Theune · [email protected] · +49 345 219401 0
> >> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
> >> Zagrodnick
> 
> 
> Liebe Grüße,
> Christian Theune
> 
> -- 
> Christian Theune · [email protected] · +49 345 219401 0
> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> 


_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to