Hi, sorry, I didn’t expect you to make me a branch. I had already taken your diff over to 5.10 as it applied cleanly … sorry for the additional work and thanks anyways.
Here’s the output: [ 6521.905890] sysrq: Trigger a crash [ 6521.909294] Kernel panic - not syncing: sysrq triggered crash [ 6521.915026] CPU: 1 PID: 43785 Comm: bash Tainted: G I 5.10.159 #1-NixOS [ 6521.922925] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 1.11.0 07/23/2012 [ 6521.930475] Call Trace: [ 6521.932923] dump_stack+0x6b/0x83 [ 6521.936230] panic+0x101/0x2c8 [ 6521.939276] ? printk+0x58/0x73 [ 6521.942408] sysrq_handle_crash+0x16/0x20 [ 6521.946407] __handle_sysrq.cold+0x43/0x11a [ 6521.950580] write_sysrq_trigger+0x24/0x40 [ 6521.954668] proc_reg_write+0x51/0x90 [ 6521.958322] vfs_write+0xc3/0x280 [ 6521.961627] ksys_write+0x5f/0xe0 [ 6521.964935] do_syscall_64+0x33/0x40 [ 6521.968502] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 6521.973540] RIP: 0033:0x7f2c6b91a133 [ 6521.977106] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5 [ 6521.995836] RSP: 002b:00007ffc4cf11088 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 6522.003387] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f2c6b91a133 [ 6522.010505] RDX: 0000000000000002 RSI: 0000000001555c08 RDI: 0000000000000001 [ 6522.017623] RBP: 0000000001555c08 R08: 000000000000000a R09: 00007f2c6b9aaf40 [ 6522.024743] R10: 00000000016e4218 R11: 0000000000000246 R12: 0000000000000002 [ 6522.031864] R13: 00007f2c6b9e8520 R14: 00007f2c6b9e8720 R15: 0000000000000002 [ 6522.039085] Calling notifier panic_event+0x0/0x410 [ipmi_msghandler] (000000008eb8cb44) [ 6522.047071] IPMI message handler: IPMI: panic event handler [ 6522.052628] IPMI message handler: IPMI: handling panic event for intf 0: 00000000443777b3 0000000067d05ff8 … and then it reboots after the 255 seconds from the watchdog timer are passed. Christian > On 13. Mar 2023, at 18:13, Corey Minyard <miny...@acm.org> wrote: > > On Mon, Mar 13, 2023 at 05:42:39PM +0100, Christian Theune wrote: >> Hrghs. I’m applying your patch to 5.10 as my distro build infrastructure has >> some patches that don’t apply to 6.2 and that I don’t know how to circumvent >> quickly enough… :) > > Ok, there's a > > https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events-5.10 > > branch available for you to pull. It's on top of latest 5.10. > > -corey > >> >>> On 13. Mar 2023, at 16:59, Christian Theune <c...@flyingcircus.io> wrote: >>> >>> I should be easily able to run 6.2, no worries. >>> >>> >>>> On 13. Mar 2023, at 16:33, Corey Minyard <miny...@acm.org> wrote: >>>> >>>> On Mon, Mar 13, 2023 at 02:07:01PM +0100, Christian Theune wrote: >>>>> Hi, >>>>> >>>>> yeah, the IPMI log is fine. This is a 10 minute interval job in our >>>>> system that exports the log and clears it: >>>>> >>>>> The job looks like this: >>>>> >>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool >>>>> sel elist >>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool >>>>> sel clear >>>>> >>>>> So it’s not atomic but it runs after the boot and the elist should output >>>>> it properly … at least it did in the past. ;) >>>>> >>>>> As I said - I’m happy to run any patches you have. If you point me to a >>>>> git branch somewhere I can switch that system easily. >>>> >>>> Ok, I have a branch at >>>> >>>> https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events >>>> >>>> that has debug tracing. It will print the function for all panic event >>>> handlers, their return values, and adds traces in the IPMI panic event >>>> handlers. >>>> >>>> It's a single patch right on top of 6.2; I'm not sure how portable it is >>>> to other kernel versions. I can port if you like. >>>> >>>> Thanks, >>>> >>>> -corey >>>> >>>>> >>>>> Cheers, >>>>> Christian >>>>> >>>>>>> On 13. Mar 2023, at 13:58, Corey Minyard <miny...@acm.org> wrote: >>>>>> >>>>>> On Mon, Mar 13, 2023 at 10:27:51AM +0100, Christian Theune wrote: >>>>>>> Hi, >>>>>>> >>>>>>> alright, so here’s the output from the NixOS machine: >>>>>>> >>>>>>> root@xxx ~ # echo c >/proc/sysrq-trigger >>>>>>> client_loop: send disconnect: Broken pipe >>>>>>> … >>>>>>> >>>>>>> root@xxx ~ # journalctl -u ipmi-log.service >>>>>>> -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon >>>>>>> 2023-03-13 10:25:27 CET. -- >>>>>>> Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL. Please allow >>>>>>> a few seconds to erase. >>>>>>> ... >>>>>>> -- Boot fdef496e784e4541abd9ae40df472a0b -- >>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]: 1 | 03/13/2023 | 09:12:49 >>>>>>> | Event Logging Disabled SEL | Log area reset/cleared | Asserted >>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]: 2 | 03/13/2023 | 09:21:06 >>>>>>> | Watchdog2 OS Watchdog | Hard reset | Asserted >>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL. Please allow a >>>>>>> few seconds to erase. >>>>>> >>>>>> Hmm, the SEL got cleared. That would clear out any of the logs that >>>>>> were issued before that time. I'm not sure when the above happened >>>>>> verses the crash, though. It looks like it occurred as part of the >>>>>> reboot, but I'm not sure what I'm seeing. Maybe you have a startup >>>>>> process that clears the SEL? >>>>>> >>>>>> Assuming that's not the issue, what you have looks ok. I'd need to add >>>>>> some logs to the kernel to see if the log operation ever happens. >>>>>> >>>>>> -corey >>>>>> >>>>>>> >>>>>>> The SOL log looks like this: >>>>>>> >>>>>>> >>>>>>> [1107585.917689] sysrq: Trigger a crash >>>>>>> [1107585.921272] Kernel panic - not syncing: sysrq triggered crash >>>>>>> [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G I >>>>>>> 5.10.159 #1-NixOS >>>>>>> [1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS >>>>>>> 1.11.0 07/23/2012 >>>>>>> [1107585.943058] Call Trace: >>>>>>> [1107585.945680] dump_stack+0x6b/0x83 >>>>>>> [1107585.949158] panic+0x101/0x2c8 >>>>>>> [1107585.952379] ? printk+0x58/0x73 >>>>>>> [1107585.955687] sysrq_handle_crash+0x16/0x20 >>>>>>> [1107585.959859] __handle_sysrq.cold+0x43/0x11a >>>>>>> [1107585.964203] write_sysrq_trigger+0x24/0x40 >>>>>>> [1107585.968463] proc_reg_write+0x51/0x90 >>>>>>> [1107585.972290] vfs_write+0xc3/0x280 >>>>>>> [1107585.975768] ksys_write+0x5f/0xe0 >>>>>>> [1107585.979248] do_syscall_64+0x33/0x40 >>>>>>> [1107585.982987] entry_SYSCALL_64_after_hwframe+0x61/0xc6 >>>>>>> [1107585.988199] RIP: 0033:0x7f5873932133 >>>>>>> [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 >>>>>>> 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 >>>>>>> 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 >>>>>>> f5 >>>>>>> [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 ORIG_RAX: >>>>>>> 0000000000000001 >>>>>>> [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: >>>>>>> 00007f5873932133 >>>>>>> [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: >>>>>>> 0000000000000001 >>>>>>> [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: >>>>>>> 00007f58739c2f40 >>>>>>> [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: >>>>>>> 0000000000000002 >>>>>>> [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: >>>>>>> 0000000000000002 >>>>>>> >>>>>>> Nothing obvious to me here … if you have any further ideas what to >>>>>>> test, let me know. I should be more responsive again now. >>>>>>> >>>>>>> Thanks and kind regards, >>>>>>> Christian >>>>>>> >>>>>>>> On 5. Mar 2023, at 23:53, Corey Minyard <miny...@acm.org> wrote: >>>>>>>> >>>>>>>> On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote: >>>>>>>>> I’m going to actually attach a serial console to watch the “echo c” >>>>>>>>> panic, maybe that gives _some_ indication. >>>>>>>>> >>>>>>>>> Otherwise: I can quickly run patches on the kernel there to try out >>>>>>>>> things. (And the funding offer still stands.) >>>>>>>> >>>>>>>> Any news on this? I'm curious what this could be. >>>>>>>> >>>>>>>> -corey >>>>>>>> >>>>>>>>> >>>>>>>>> Christian >>>>>>>>> >>>>>>>>>> On 1. Mar 2023, at 17:58, Corey Minyard <miny...@acm.org> wrote: >>>>>>>>>> >>>>>>>>>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote: >>>>>>>>>>> Thanks, both machines report: >>>>>>>>>>> >>>>>>>>>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op >>>>>>>>>>> string >>>>>>>>>> >>>>>>>>>> At this point, I have no idea. I'd have to start adding printks into >>>>>>>>>> the code and cause crashes to see what is happing. >>>>>>>>>> >>>>>>>>>> Maybe something is getting in the way of the panic notifiers and >>>>>>>>>> doing >>>>>>>>>> something to prevent the IPMI driver from working. >>>>>>>>>> >>>>>>>>>> -corey >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 28. Feb 2023, at 18:04, Corey Minyard <miny...@acm.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Oh, I forgot. You can look at panic_op in >>>>>>>>>>>> /sys/module/ipmi_msghandler/parameters/panic_op >>>>>>>>>>>> >>>>>>>>>>>> -corey >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via >>>>>>>>>>>> Openipmi-developer wrote: >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <miny...@acm.org> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via >>>>>>>>>>>>>> Openipmi-developer wrote: >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’ve been trying to debug the PANIC and OEM string handling and >>>>>>>>>>>>>>> am running out of ideas whether this is a bug or whether >>>>>>>>>>>>>>> something so subtle has changed in my config that I’m just not >>>>>>>>>>>>>>> seeing it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> (Note: I’m willing to pay for consulting.) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Probably not necessary. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks! The offer always stands. If we should ever meet I’m also >>>>>>>>>>>>> able to pay in beverages. ;) >>>>>>>>>>>>> >>>>>>>>>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, >>>>>>>>>>>>>>> (mostly) vanilla kernel 4.19.157) to a newer setup (NixOS, >>>>>>>>>>>>>>> (mostly) vanilla kernel 5.10.159) and I’m now experiencing >>>>>>>>>>>>>>> crashes that seem to be kernel panics but do not get the usual >>>>>>>>>>>>>>> messages in the IPMI SEL. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I just tested on stock 5.10.159 and it worked without issue. >>>>>>>>>>>>>> Everything >>>>>>>>>>>>>> you have below looks ok. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can you test by causing a crash with: >>>>>>>>>>>>>> >>>>>>>>>>>>>> echo c >/proc/sysrq-trigger >>>>>>>>>>>>>> >>>>>>>>>>>>>> and see if it works? >>>>>>>>>>>>> >>>>>>>>>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work. >>>>>>>>>>>>> >>>>>>>>>>>>>> It sounds like you are having some type of crash that you would >>>>>>>>>>>>>> normally >>>>>>>>>>>>>> use the IPMI logs to debug. However, they aren't perfect, the >>>>>>>>>>>>>> system >>>>>>>>>>>>>> has to stay up long enough to get them into the event log. >>>>>>>>>>>>> >>>>>>>>>>>>> I think they are staying up long enough because a panic triggers >>>>>>>>>>>>> the 255 second bump in the watchdog and only then pass on. >>>>>>>>>>>>> However, i’ve also noticed that the kernel _should_ be rebooting >>>>>>>>>>>>> after a panic much faster (and not rely on the watchdog) and that >>>>>>>>>>>>> doesn’t happen either. (Sorry this just popped from the back of >>>>>>>>>>>>> my head). >>>>>>>>>>>>> >>>>>>>>>>>>>> In this situation, getting a serial console (probably through >>>>>>>>>>>>>> IPMI >>>>>>>>>>>>>> Serial over LAN) and getting the console output on a crash is >>>>>>>>>>>>>> probably >>>>>>>>>>>>>> your best option. You can use ipmitool for this, or I have a >>>>>>>>>>>>>> library >>>>>>>>>>>>>> that is able to make connections to serial ports, including >>>>>>>>>>>>>> through IPMI >>>>>>>>>>>>>> SoL. >>>>>>>>>>>>> >>>>>>>>>>>>> Yup. Been there, too. :) >>>>>>>>>>>>> >>>>>>>>>>>>> Unfortunately we’re currently chasing something that pops up very >>>>>>>>>>>>> randomly on somewhat odd machines and I also have the feeling >>>>>>>>>>>>> that it’s systematically broken right now (as the “echo c” >>>>>>>>>>>>> doesn’t work). >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks a lot, >>>>>>>>>>>>> Christian >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >>>>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, >>>>>>>>>>>>> Christian Zagrodnick >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Openipmi-developer mailing list >>>>>>>>>>>>> Openipmi-developer@lists.sourceforge.net >>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer >>>>>>>>>>> >>>>>>>>>>> Liebe Grüße, >>>>>>>>>>> Christian Theune >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>>>>>>>>> Zagrodnick >>>>>>>>>>> >>>>>>>>> >>>>>>>>> Liebe Grüße, >>>>>>>>> Christian Theune >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>>>>>>> Zagrodnick >>>>>>>>> >>>>>>> >>>>>>> Liebe Grüße, >>>>>>> Christian Theune >>>>>>> >>>>>>> -- >>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>>>>> Zagrodnick >>>>> >>>>> >>>>> Liebe Grüße, >>>>> Christian Theune >>>>> >>>>> -- >>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>>> Zagrodnick >>>>> >> >> Liebe Grüße, >> Christian Theune >> >> -- >> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >> Flying Circus Internet Operations GmbH · https://flyingcircus.io >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >> Zagrodnick >> Liebe Grüße, Christian Theune -- Christian Theune · c...@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick _______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer