Awesome! > On 14. Mar 2023, at 15:54, Corey Minyard <miny...@acm.org> wrote: > > On Tue, Mar 14, 2023 at 03:22:39PM +0100, Christian Theune via > Openipmi-developer wrote: >> Hi, >> >> sorry, I didn’t expect you to make me a branch. I had already taken your >> diff over to 5.10 as it applied cleanly … sorry for the additional work and >> thanks anyways. > > Ok, that's great. It's something about the IPMI watchdog panic > routines, and I can reproduce. I should be able to fix this pretty > quickly. I'll send a patch when I get this fixed. > > Thanks, > > -corey > >> >> Here’s the output: >> >> [ 6521.905890] sysrq: Trigger a crash >> [ 6521.909294] Kernel panic - not syncing: sysrq triggered crash >> [ 6521.915026] CPU: 1 PID: 43785 Comm: bash Tainted: G I >> 5.10.159 #1-NixOS >> [ 6521.922925] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 1.11.0 >> 07/23/2012 >> [ 6521.930475] Call Trace: >> [ 6521.932923] dump_stack+0x6b/0x83 >> [ 6521.936230] panic+0x101/0x2c8 >> [ 6521.939276] ? printk+0x58/0x73 >> [ 6521.942408] sysrq_handle_crash+0x16/0x20 >> [ 6521.946407] __handle_sysrq.cold+0x43/0x11a >> [ 6521.950580] write_sysrq_trigger+0x24/0x40 >> [ 6521.954668] proc_reg_write+0x51/0x90 >> [ 6521.958322] vfs_write+0xc3/0x280 >> [ 6521.961627] ksys_write+0x5f/0xe0 >> [ 6521.964935] do_syscall_64+0x33/0x40 >> [ 6521.968502] entry_SYSCALL_64_after_hwframe+0x61/0xc6 >> [ 6521.973540] RIP: 0033:0x7f2c6b91a133 >> [ 6521.977106] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f >> 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> >> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5 >> [ 6521.995836] RSP: 002b:00007ffc4cf11088 EFLAGS: 00000246 ORIG_RAX: >> 0000000000000001 >> [ 6522.003387] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: >> 00007f2c6b91a133 >> [ 6522.010505] RDX: 0000000000000002 RSI: 0000000001555c08 RDI: >> 0000000000000001 >> [ 6522.017623] RBP: 0000000001555c08 R08: 000000000000000a R09: >> 00007f2c6b9aaf40 >> [ 6522.024743] R10: 00000000016e4218 R11: 0000000000000246 R12: >> 0000000000000002 >> [ 6522.031864] R13: 00007f2c6b9e8520 R14: 00007f2c6b9e8720 R15: >> 0000000000000002 >> [ 6522.039085] Calling notifier panic_event+0x0/0x410 [ipmi_msghandler] >> (000000008eb8cb44) >> [ 6522.047071] IPMI message handler: IPMI: panic event handler >> [ 6522.052628] IPMI message handler: IPMI: handling panic event for intf 0: >> 00000000443777b3 0000000067d05ff8 >> … >> and then it reboots after the 255 seconds from the watchdog timer are passed. >> >> Christian >> >>> On 13. Mar 2023, at 18:13, Corey Minyard <miny...@acm.org> wrote: >>> >>> On Mon, Mar 13, 2023 at 05:42:39PM +0100, Christian Theune wrote: >>>> Hrghs. I’m applying your patch to 5.10 as my distro build infrastructure >>>> has some patches that don’t apply to 6.2 and that I don’t know how to >>>> circumvent quickly enough… :) >>> >>> Ok, there's a >>> >>> https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events-5.10 >>> >>> branch available for you to pull. It's on top of latest 5.10. >>> >>> -corey >>> >>>> >>>>> On 13. Mar 2023, at 16:59, Christian Theune <c...@flyingcircus.io> wrote: >>>>> >>>>> I should be easily able to run 6.2, no worries. >>>>> >>>>> >>>>>> On 13. Mar 2023, at 16:33, Corey Minyard <miny...@acm.org> wrote: >>>>>> >>>>>> On Mon, Mar 13, 2023 at 02:07:01PM +0100, Christian Theune wrote: >>>>>>> Hi, >>>>>>> >>>>>>> yeah, the IPMI log is fine. This is a 10 minute interval job in our >>>>>>> system that exports the log and clears it: >>>>>>> >>>>>>> The job looks like this: >>>>>>> >>>>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool >>>>>>> sel elist >>>>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool >>>>>>> sel clear >>>>>>> >>>>>>> So it’s not atomic but it runs after the boot and the elist should >>>>>>> output it properly … at least it did in the past. ;) >>>>>>> >>>>>>> As I said - I’m happy to run any patches you have. If you point me to a >>>>>>> git branch somewhere I can switch that system easily. >>>>>> >>>>>> Ok, I have a branch at >>>>>> >>>>>> https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events >>>>>> >>>>>> that has debug tracing. It will print the function for all panic event >>>>>> handlers, their return values, and adds traces in the IPMI panic event >>>>>> handlers. >>>>>> >>>>>> It's a single patch right on top of 6.2; I'm not sure how portable it is >>>>>> to other kernel versions. I can port if you like. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> -corey >>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> Christian >>>>>>> >>>>>>>>> On 13. Mar 2023, at 13:58, Corey Minyard <miny...@acm.org> wrote: >>>>>>>> >>>>>>>> On Mon, Mar 13, 2023 at 10:27:51AM +0100, Christian Theune wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> alright, so here’s the output from the NixOS machine: >>>>>>>>> >>>>>>>>> root@xxx ~ # echo c >/proc/sysrq-trigger >>>>>>>>> client_loop: send disconnect: Broken pipe >>>>>>>>> … >>>>>>>>> >>>>>>>>> root@xxx ~ # journalctl -u ipmi-log.service >>>>>>>>> -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon >>>>>>>>> 2023-03-13 10:25:27 CET. -- >>>>>>>>> Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL. Please >>>>>>>>> allow a few seconds to erase. >>>>>>>>> ... >>>>>>>>> -- Boot fdef496e784e4541abd9ae40df472a0b -- >>>>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]: 1 | 03/13/2023 | >>>>>>>>> 09:12:49 | Event Logging Disabled SEL | Log area reset/cleared | >>>>>>>>> Asserted >>>>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]: 2 | 03/13/2023 | >>>>>>>>> 09:21:06 | Watchdog2 OS Watchdog | Hard reset | Asserted >>>>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL. Please allow >>>>>>>>> a few seconds to erase. >>>>>>>> >>>>>>>> Hmm, the SEL got cleared. That would clear out any of the logs that >>>>>>>> were issued before that time. I'm not sure when the above happened >>>>>>>> verses the crash, though. It looks like it occurred as part of the >>>>>>>> reboot, but I'm not sure what I'm seeing. Maybe you have a startup >>>>>>>> process that clears the SEL? >>>>>>>> >>>>>>>> Assuming that's not the issue, what you have looks ok. I'd need to add >>>>>>>> some logs to the kernel to see if the log operation ever happens. >>>>>>>> >>>>>>>> -corey >>>>>>>> >>>>>>>>> >>>>>>>>> The SOL log looks like this: >>>>>>>>> >>>>>>>>> >>>>>>>>> [1107585.917689] sysrq: Trigger a crash >>>>>>>>> [1107585.921272] Kernel panic - not syncing: sysrq triggered crash >>>>>>>>> [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G I >>>>>>>>> 5.10.159 #1-NixOS >>>>>>>>> [1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS >>>>>>>>> 1.11.0 07/23/2012 >>>>>>>>> [1107585.943058] Call Trace: >>>>>>>>> [1107585.945680] dump_stack+0x6b/0x83 >>>>>>>>> [1107585.949158] panic+0x101/0x2c8 >>>>>>>>> [1107585.952379] ? printk+0x58/0x73 >>>>>>>>> [1107585.955687] sysrq_handle_crash+0x16/0x20 >>>>>>>>> [1107585.959859] __handle_sysrq.cold+0x43/0x11a >>>>>>>>> [1107585.964203] write_sysrq_trigger+0x24/0x40 >>>>>>>>> [1107585.968463] proc_reg_write+0x51/0x90 >>>>>>>>> [1107585.972290] vfs_write+0xc3/0x280 >>>>>>>>> [1107585.975768] ksys_write+0x5f/0xe0 >>>>>>>>> [1107585.979248] do_syscall_64+0x33/0x40 >>>>>>>>> [1107585.982987] entry_SYSCALL_64_after_hwframe+0x61/0xc6 >>>>>>>>> [1107585.988199] RIP: 0033:0x7f5873932133 >>>>>>>>> [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb >>>>>>>>> b3 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 >>>>>>>>> 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 >>>>>>>>> 55 48 89 f5 >>>>>>>>> [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 >>>>>>>>> ORIG_RAX: 0000000000000001 >>>>>>>>> [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: >>>>>>>>> 00007f5873932133 >>>>>>>>> [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: >>>>>>>>> 0000000000000001 >>>>>>>>> [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: >>>>>>>>> 00007f58739c2f40 >>>>>>>>> [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: >>>>>>>>> 0000000000000002 >>>>>>>>> [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: >>>>>>>>> 0000000000000002 >>>>>>>>> >>>>>>>>> Nothing obvious to me here … if you have any further ideas what to >>>>>>>>> test, let me know. I should be more responsive again now. >>>>>>>>> >>>>>>>>> Thanks and kind regards, >>>>>>>>> Christian >>>>>>>>> >>>>>>>>>> On 5. Mar 2023, at 23:53, Corey Minyard <miny...@acm.org> wrote: >>>>>>>>>> >>>>>>>>>> On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote: >>>>>>>>>>> I’m going to actually attach a serial console to watch the “echo c” >>>>>>>>>>> panic, maybe that gives _some_ indication. >>>>>>>>>>> >>>>>>>>>>> Otherwise: I can quickly run patches on the kernel there to try out >>>>>>>>>>> things. (And the funding offer still stands.) >>>>>>>>>> >>>>>>>>>> Any news on this? I'm curious what this could be. >>>>>>>>>> >>>>>>>>>> -corey >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Christian >>>>>>>>>>> >>>>>>>>>>>> On 1. Mar 2023, at 17:58, Corey Minyard <miny...@acm.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote: >>>>>>>>>>>>> Thanks, both machines report: >>>>>>>>>>>>> >>>>>>>>>>>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op >>>>>>>>>>>>> string >>>>>>>>>>>> >>>>>>>>>>>> At this point, I have no idea. I'd have to start adding printks >>>>>>>>>>>> into >>>>>>>>>>>> the code and cause crashes to see what is happing. >>>>>>>>>>>> >>>>>>>>>>>> Maybe something is getting in the way of the panic notifiers and >>>>>>>>>>>> doing >>>>>>>>>>>> something to prevent the IPMI driver from working. >>>>>>>>>>>> >>>>>>>>>>>> -corey >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On 28. Feb 2023, at 18:04, Corey Minyard <miny...@acm.org> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Oh, I forgot. You can look at panic_op in >>>>>>>>>>>>>> /sys/module/ipmi_msghandler/parameters/panic_op >>>>>>>>>>>>>> >>>>>>>>>>>>>> -corey >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via >>>>>>>>>>>>>> Openipmi-developer wrote: >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <miny...@acm.org> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via >>>>>>>>>>>>>>>> Openipmi-developer wrote: >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I’ve been trying to debug the PANIC and OEM string handling >>>>>>>>>>>>>>>>> and am running out of ideas whether this is a bug or whether >>>>>>>>>>>>>>>>> something so subtle has changed in my config that I’m just >>>>>>>>>>>>>>>>> not seeing it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> (Note: I’m willing to pay for consulting.) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Probably not necessary. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks! The offer always stands. If we should ever meet I’m >>>>>>>>>>>>>>> also able to pay in beverages. ;) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, >>>>>>>>>>>>>>>>> (mostly) vanilla kernel 4.19.157) to a newer setup (NixOS, >>>>>>>>>>>>>>>>> (mostly) vanilla kernel 5.10.159) and I’m now experiencing >>>>>>>>>>>>>>>>> crashes that seem to be kernel panics but do not get the >>>>>>>>>>>>>>>>> usual messages in the IPMI SEL. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I just tested on stock 5.10.159 and it worked without issue. >>>>>>>>>>>>>>>> Everything >>>>>>>>>>>>>>>> you have below looks ok. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Can you test by causing a crash with: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> echo c >/proc/sysrq-trigger >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> and see if it works? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It sounds like you are having some type of crash that you >>>>>>>>>>>>>>>> would normally >>>>>>>>>>>>>>>> use the IPMI logs to debug. However, they aren't perfect, the >>>>>>>>>>>>>>>> system >>>>>>>>>>>>>>>> has to stay up long enough to get them into the event log. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think they are staying up long enough because a panic >>>>>>>>>>>>>>> triggers the 255 second bump in the watchdog and only then pass >>>>>>>>>>>>>>> on. However, i’ve also noticed that the kernel _should_ be >>>>>>>>>>>>>>> rebooting after a panic much faster (and not rely on the >>>>>>>>>>>>>>> watchdog) and that doesn’t happen either. (Sorry this just >>>>>>>>>>>>>>> popped from the back of my head). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In this situation, getting a serial console (probably through >>>>>>>>>>>>>>>> IPMI >>>>>>>>>>>>>>>> Serial over LAN) and getting the console output on a crash is >>>>>>>>>>>>>>>> probably >>>>>>>>>>>>>>>> your best option. You can use ipmitool for this, or I have a >>>>>>>>>>>>>>>> library >>>>>>>>>>>>>>>> that is able to make connections to serial ports, including >>>>>>>>>>>>>>>> through IPMI >>>>>>>>>>>>>>>> SoL. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yup. Been there, too. :) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Unfortunately we’re currently chasing something that pops up >>>>>>>>>>>>>>> very randomly on somewhat odd machines and I also have the >>>>>>>>>>>>>>> feeling that it’s systematically broken right now (as the “echo >>>>>>>>>>>>>>> c” doesn’t work). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks a lot, >>>>>>>>>>>>>>> Christian >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >>>>>>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, >>>>>>>>>>>>>>> Christian Zagrodnick >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> Openipmi-developer mailing list >>>>>>>>>>>>>>> Openipmi-developer@lists.sourceforge.net >>>>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer >>>>>>>>>>>>> >>>>>>>>>>>>> Liebe Grüße, >>>>>>>>>>>>> Christian Theune >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >>>>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, >>>>>>>>>>>>> Christian Zagrodnick >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Liebe Grüße, >>>>>>>>>>> Christian Theune >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>>>>>>>>> Zagrodnick >>>>>>>>>>> >>>>>>>>> >>>>>>>>> Liebe Grüße, >>>>>>>>> Christian Theune >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>>>>>>> Zagrodnick >>>>>>> >>>>>>> >>>>>>> Liebe Grüße, >>>>>>> Christian Theune >>>>>>> >>>>>>> -- >>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>>>>> Zagrodnick >>>>>>> >>>> >>>> Liebe Grüße, >>>> Christian Theune >>>> >>>> -- >>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >>>> Zagrodnick >>>> >> >> Liebe Grüße, >> Christian Theune >> >> -- >> Christian Theune · c...@flyingcircus.io · +49 345 219401 0 >> Flying Circus Internet Operations GmbH · https://flyingcircus.io >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian >> Zagrodnick >> >> >> >> _______________________________________________ >> Openipmi-developer mailing list >> Openipmi-developer@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/openipmi-developer
Liebe Grüße, Christian Theune -- Christian Theune · c...@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick _______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer