On Tue, Mar 14, 2023 at 03:22:39PM +0100, Christian Theune via Openipmi-developer wrote: > Hi, > > sorry, I didn’t expect you to make me a branch. I had already taken your diff > over to 5.10 as it applied cleanly … sorry for the additional work and thanks > anyways.
Ok, that's great. It's something about the IPMI watchdog panic routines, and I can reproduce. I should be able to fix this pretty quickly. I'll send a patch when I get this fixed. Thanks, -corey > > Here’s the output: > > [ 6521.905890] sysrq: Trigger a crash > [ 6521.909294] Kernel panic - not syncing: sysrq triggered crash > [ 6521.915026] CPU: 1 PID: 43785 Comm: bash Tainted: G I > 5.10.159 #1-NixOS > [ 6521.922925] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 1.11.0 > 07/23/2012 > [ 6521.930475] Call Trace: > [ 6521.932923] dump_stack+0x6b/0x83 > [ 6521.936230] panic+0x101/0x2c8 > [ 6521.939276] ? printk+0x58/0x73 > [ 6521.942408] sysrq_handle_crash+0x16/0x20 > [ 6521.946407] __handle_sysrq.cold+0x43/0x11a > [ 6521.950580] write_sysrq_trigger+0x24/0x40 > [ 6521.954668] proc_reg_write+0x51/0x90 > [ 6521.958322] vfs_write+0xc3/0x280 > [ 6521.961627] ksys_write+0x5f/0xe0 > [ 6521.964935] do_syscall_64+0x33/0x40 > [ 6521.968502] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > [ 6521.973540] RIP: 0033:0x7f2c6b91a133 > [ 6521.977106] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 > 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d > 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5 > [ 6521.995836] RSP: 002b:00007ffc4cf11088 EFLAGS: 00000246 ORIG_RAX: > 0000000000000001 > [ 6522.003387] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: > 00007f2c6b91a133 > [ 6522.010505] RDX: 0000000000000002 RSI: 0000000001555c08 RDI: > 0000000000000001 > [ 6522.017623] RBP: 0000000001555c08 R08: 000000000000000a R09: > 00007f2c6b9aaf40 > [ 6522.024743] R10: 00000000016e4218 R11: 0000000000000246 R12: > 0000000000000002 > [ 6522.031864] R13: 00007f2c6b9e8520 R14: 00007f2c6b9e8720 R15: > 0000000000000002 > [ 6522.039085] Calling notifier panic_event+0x0/0x410 [ipmi_msghandler] > (000000008eb8cb44) > [ 6522.047071] IPMI message handler: IPMI: panic event handler > [ 6522.052628] IPMI message handler: IPMI: handling panic event for intf 0: > 00000000443777b3 0000000067d05ff8 > … > and then it reboots after the 255 seconds from the watchdog timer are passed. > > Christian > > > On 13. Mar 2023, at 18:13, Corey Minyard <miny...@acm.org> wrote: > > > > On Mon, Mar 13, 2023 at 05:42:39PM +0100, Christian Theune wrote: > >> Hrghs. I’m applying your patch to 5.10 as my distro build infrastructure > >> has some patches that don’t apply to 6.2 and that I don’t know how to > >> circumvent quickly enough… :) > > > > Ok, there's a > > > > https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events-5.10 > > > > branch available for you to pull. It's on top of latest 5.10. > > > > -corey > > > >> > >>> On 13. Mar 2023, at 16:59, Christian Theune <c...@flyingcircus.io> wrote: > >>> > >>> I should be easily able to run 6.2, no worries. > >>> > >>> > >>>> On 13. Mar 2023, at 16:33, Corey Minyard <miny...@acm.org> wrote: > >>>> > >>>> On Mon, Mar 13, 2023 at 02:07:01PM +0100, Christian Theune wrote: > >>>>> Hi, > >>>>> > >>>>> yeah, the IPMI log is fine. This is a 10 minute interval job in our > >>>>> system that exports the log and clears it: > >>>>> > >>>>> The job looks like this: > >>>>> > >>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool > >>>>> sel elist > >>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool > >>>>> sel clear > >>>>> > >>>>> So it’s not atomic but it runs after the boot and the elist should > >>>>> output it properly … at least it did in the past. ;) > >>>>> > >>>>> As I said - I’m happy to run any patches you have. If you point me to a > >>>>> git branch somewhere I can switch that system easily. > >>>> > >>>> Ok, I have a branch at > >>>> > >>>> https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events > >>>> > >>>> that has debug tracing. It will print the function for all panic event > >>>> handlers, their return values, and adds traces in the IPMI panic event > >>>> handlers. > >>>> > >>>> It's a single patch right on top of 6.2; I'm not sure how portable it is > >>>> to other kernel versions. I can port if you like. > >>>> > >>>> Thanks, > >>>> > >>>> -corey > >>>> > >>>>> > >>>>> Cheers, > >>>>> Christian > >>>>> > >>>>>>> On 13. Mar 2023, at 13:58, Corey Minyard <miny...@acm.org> wrote: > >>>>>> > >>>>>> On Mon, Mar 13, 2023 at 10:27:51AM +0100, Christian Theune wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> alright, so here’s the output from the NixOS machine: > >>>>>>> > >>>>>>> root@xxx ~ # echo c >/proc/sysrq-trigger > >>>>>>> client_loop: send disconnect: Broken pipe > >>>>>>> … > >>>>>>> > >>>>>>> root@xxx ~ # journalctl -u ipmi-log.service > >>>>>>> -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon > >>>>>>> 2023-03-13 10:25:27 CET. -- > >>>>>>> Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL. Please > >>>>>>> allow a few seconds to erase. > >>>>>>> ... > >>>>>>> -- Boot fdef496e784e4541abd9ae40df472a0b -- > >>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]: 1 | 03/13/2023 | > >>>>>>> 09:12:49 | Event Logging Disabled SEL | Log area reset/cleared | > >>>>>>> Asserted > >>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]: 2 | 03/13/2023 | > >>>>>>> 09:21:06 | Watchdog2 OS Watchdog | Hard reset | Asserted > >>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL. Please allow > >>>>>>> a few seconds to erase. > >>>>>> > >>>>>> Hmm, the SEL got cleared. That would clear out any of the logs that > >>>>>> were issued before that time. I'm not sure when the above happened > >>>>>> verses the crash, though. It looks like it occurred as part of the > >>>>>> reboot, but I'm not sure what I'm seeing. Maybe you have a startup > >>>>>> process that clears the SEL? > >>>>>> > >>>>>> Assuming that's not the issue, what you have looks ok. I'd need to add > >>>>>> some logs to the kernel to see if the log operation ever happens. > >>>>>> > >>>>>> -corey > >>>>>> > >>>>>>> > >>>>>>> The SOL log looks like this: > >>>>>>> > >>>>>>> > >>>>>>> [1107585.917689] sysrq: Trigger a crash > >>>>>>> [1107585.921272] Kernel panic - not syncing: sysrq triggered crash > >>>>>>> [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G I > >>>>>>> 5.10.159 #1-NixOS > >>>>>>> [1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS > >>>>>>> 1.11.0 07/23/2012 > >>>>>>> [1107585.943058] Call Trace: > >>>>>>> [1107585.945680] dump_stack+0x6b/0x83 > >>>>>>> [1107585.949158] panic+0x101/0x2c8 > >>>>>>> [1107585.952379] ? printk+0x58/0x73 > >>>>>>> [1107585.955687] sysrq_handle_crash+0x16/0x20 > >>>>>>> [1107585.959859] __handle_sysrq.cold+0x43/0x11a > >>>>>>> [1107585.964203] write_sysrq_trigger+0x24/0x40 > >>>>>>> [1107585.968463] proc_reg_write+0x51/0x90 > >>>>>>> [1107585.972290] vfs_write+0xc3/0x280 > >>>>>>> [1107585.975768] ksys_write+0x5f/0xe0 > >>>>>>> [1107585.979248] do_syscall_64+0x33/0x40 > >>>>>>> [1107585.982987] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > >>>>>>> [1107585.988199] RIP: 0033:0x7f5873932133 > >>>>>>> [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb > >>>>>>> b3 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 > >>>>>>> 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 > >>>>>>> 55 48 89 f5 > >>>>>>> [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 > >>>>>>> ORIG_RAX: 0000000000000001 > >>>>>>> [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: > >>>>>>> 00007f5873932133 > >>>>>>> [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: > >>>>>>> 0000000000000001 > >>>>>>> [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: > >>>>>>> 00007f58739c2f40 > >>>>>>> [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: > >>>>>>> 0000000000000002 > >>>>>>> [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: > >>>>>>> 0000000000000002 > >>>>>>> > >>>>>>> Nothing obvious to me here … if you have any further ideas what to > >>>>>>> test, let me know. I should be more responsive again now. > >>>>>>> > >>>>>>> Thanks and kind regards, > >>>>>>> Christian > >>>>>>> > >>>>>>>> On 5. Mar 2023, at 23:53, Corey Minyard <miny...@acm.org> wrote: > >>>>>>>> > >>>>>>>> On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote: > >>>>>>>>> I’m going to actually attach a serial console to watch the “echo c” > >>>>>>>>> panic, maybe that gives _some_ indication. > >>>>>>>>> > >>>>>>>>> Otherwise: I can quickly run patches on the kernel there to try out > >>>>>>>>> things. (And the funding offer still stands.) > >>>>>>>> > >>>>>>>> Any news on this? I'm curious what this could be. > >>>>>>>> > >>>>>>>> -corey > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Christian > >>>>>>>>> > >>>>>>>>>> On 1. Mar 2023, at 17:58, Corey Minyard <miny...@acm.org> wrote: > >>>>>>>>>> > >>>>>>>>>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote: > >>>>>>>>>>> Thanks, both machines report: > >>>>>>>>>>> > >>>>>>>>>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op > >>>>>>>>>>> string > >>>>>>>>>> > >>>>>>>>>> At this point, I have no idea. I'd have to start adding printks > >>>>>>>>>> into > >>>>>>>>>> the code and cause crashes to see what is happing. > >>>>>>>>>> > >>>>>>>>>> Maybe something is getting in the way of the panic notifiers and > >>>>>>>>>> doing > >>>>>>>>>> something to prevent the IPMI driver from working. > >>>>>>>>>> > >>>>>>>>>> -corey > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> On 28. Feb 2023, at 18:04, Corey Minyard <miny...@acm.org> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Oh, I forgot. You can look at panic_op in > >>>>>>>>>>>> /sys/module/ipmi_msghandler/parameters/panic_op > >>>>>>>>>>>> > >>>>>>>>>>>> -corey > >>>>>>>>>>>> > >>>>>>>>>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via > >>>>>>>>>>>> Openipmi-developer wrote: > >>>>>>>>>>>>> Hi, > >>>>>>>>>>>>> > >>>>>>>>>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <miny...@acm.org> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via > >>>>>>>>>>>>>> Openipmi-developer wrote: > >>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I’ve been trying to debug the PANIC and OEM string handling > >>>>>>>>>>>>>>> and am running out of ideas whether this is a bug or whether > >>>>>>>>>>>>>>> something so subtle has changed in my config that I’m just > >>>>>>>>>>>>>>> not seeing it. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (Note: I’m willing to pay for consulting.) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Probably not necessary. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks! The offer always stands. If we should ever meet I’m > >>>>>>>>>>>>> also able to pay in beverages. ;) > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, > >>>>>>>>>>>>>>> (mostly) vanilla kernel 4.19.157) to a newer setup (NixOS, > >>>>>>>>>>>>>>> (mostly) vanilla kernel 5.10.159) and I’m now experiencing > >>>>>>>>>>>>>>> crashes that seem to be kernel panics but do not get the > >>>>>>>>>>>>>>> usual messages in the IPMI SEL. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I just tested on stock 5.10.159 and it worked without issue. > >>>>>>>>>>>>>> Everything > >>>>>>>>>>>>>> you have below looks ok. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Can you test by causing a crash with: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> echo c >/proc/sysrq-trigger > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> and see if it works? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> It sounds like you are having some type of crash that you > >>>>>>>>>>>>>> would normally > >>>>>>>>>>>>>> use the IPMI logs to debug. However, they aren't perfect, the > >>>>>>>>>>>>>> system > >>>>>>>>>>>>>> has to stay up long enough to get them into the event log. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I think they are staying up long enough because a panic > >>>>>>>>>>>>> triggers the 255 second bump in the watchdog and only then pass > >>>>>>>>>>>>> on. However, i’ve also noticed that the kernel _should_ be > >>>>>>>>>>>>> rebooting after a panic much faster (and not rely on the > >>>>>>>>>>>>> watchdog) and that doesn’t happen either. (Sorry this just > >>>>>>>>>>>>> popped from the back of my head). > >>>>>>>>>>>>> > >>>>>>>>>>>>>> In this situation, getting a serial console (probably through > >>>>>>>>>>>>>> IPMI > >>>>>>>>>>>>>> Serial over LAN) and getting the console output on a crash is > >>>>>>>>>>>>>> probably > >>>>>>>>>>>>>> your best option. You can use ipmitool for this, or I have a > >>>>>>>>>>>>>> library > >>>>>>>>>>>>>> that is able to make connections to serial ports, including > >>>>>>>>>>>>>> through IPMI > >>>>>>>>>>>>>> SoL. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Yup. Been there, too. :) > >>>>>>>>>>>>> > >>>>>>>>>>>>> Unfortunately we’re currently chasing something that pops up > >>>>>>>>>>>>> very randomly on somewhat odd machines and I also have the > >>>>>>>>>>>>> feeling that it’s systematically broken right now (as the “echo > >>>>>>>>>>>>> c” doesn’t work). > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks a lot, > >>>>>>>>>>>>> Christian > >>>>>>>>>>>>> > >>>>>>>>>>>>> -- > >>>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > >>>>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >>>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >>>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, > >>>>>>>>>>>>> Christian Zagrodnick > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>> Openipmi-developer mailing list > >>>>>>>>>>>>> Openipmi-developer@lists.sourceforge.net > >>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer > >>>>>>>>>>> > >>>>>>>>>>> Liebe Grüße, > >>>>>>>>>>> Christian Theune > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > >>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, > >>>>>>>>>>> Christian Zagrodnick > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>>> Liebe Grüße, > >>>>>>>>> Christian Theune > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > >>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian > >>>>>>>>> Zagrodnick > >>>>>>>>> > >>>>>>> > >>>>>>> Liebe Grüße, > >>>>>>> Christian Theune > >>>>>>> > >>>>>>> -- > >>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > >>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian > >>>>>>> Zagrodnick > >>>>> > >>>>> > >>>>> Liebe Grüße, > >>>>> Christian Theune > >>>>> > >>>>> -- > >>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > >>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian > >>>>> Zagrodnick > >>>>> > >> > >> Liebe Grüße, > >> Christian Theune > >> > >> -- > >> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > >> Flying Circus Internet Operations GmbH · https://flyingcircus.io > >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian > >> Zagrodnick > >> > > Liebe Grüße, > Christian Theune > > -- > Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > Flying Circus Internet Operations GmbH · https://flyingcircus.io > Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick > > > > _______________________________________________ > Openipmi-developer mailing list > Openipmi-developer@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/openipmi-developer _______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer