On Tue, Mar 14, 2023 at 03:22:39PM +0100, Christian Theune via 
Openipmi-developer wrote:
> Hi,
> 
> sorry, I didn’t expect you to make me a branch. I had already taken your diff 
> over to 5.10 as it applied cleanly … sorry for the additional work and thanks 
> anyways.

Ok, that's great.  It's something about the IPMI watchdog panic
routines, and I can reproduce.  I should be able to fix this pretty
quickly.  I'll send a patch when I get this fixed.

Thanks,

-corey

> 
> Here’s the output:
> 
> [ 6521.905890] sysrq: Trigger a crash
> [ 6521.909294] Kernel panic - not syncing: sysrq triggered crash
> [ 6521.915026] CPU: 1 PID: 43785 Comm: bash Tainted: G          I       
> 5.10.159 #1-NixOS
> [ 6521.922925] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 1.11.0 
> 07/23/2012
> [ 6521.930475] Call Trace:
> [ 6521.932923]  dump_stack+0x6b/0x83
> [ 6521.936230]  panic+0x101/0x2c8
> [ 6521.939276]  ? printk+0x58/0x73
> [ 6521.942408]  sysrq_handle_crash+0x16/0x20
> [ 6521.946407]  __handle_sysrq.cold+0x43/0x11a
> [ 6521.950580]  write_sysrq_trigger+0x24/0x40
> [ 6521.954668]  proc_reg_write+0x51/0x90
> [ 6521.958322]  vfs_write+0xc3/0x280
> [ 6521.961627]  ksys_write+0x5f/0xe0
> [ 6521.964935]  do_syscall_64+0x33/0x40
> [ 6521.968502]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
> [ 6521.973540] RIP: 0033:0x7f2c6b91a133
> [ 6521.977106] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 
> 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 
> 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5
> [ 6521.995836] RSP: 002b:00007ffc4cf11088 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000001
> [ 6522.003387] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 
> 00007f2c6b91a133
> [ 6522.010505] RDX: 0000000000000002 RSI: 0000000001555c08 RDI: 
> 0000000000000001
> [ 6522.017623] RBP: 0000000001555c08 R08: 000000000000000a R09: 
> 00007f2c6b9aaf40
> [ 6522.024743] R10: 00000000016e4218 R11: 0000000000000246 R12: 
> 0000000000000002
> [ 6522.031864] R13: 00007f2c6b9e8520 R14: 00007f2c6b9e8720 R15: 
> 0000000000000002
> [ 6522.039085] Calling notifier panic_event+0x0/0x410 [ipmi_msghandler] 
> (000000008eb8cb44)
> [ 6522.047071] IPMI message handler: IPMI: panic event handler
> [ 6522.052628] IPMI message handler: IPMI: handling panic event for intf 0: 
> 00000000443777b3 0000000067d05ff8
> …
> and then it reboots after the 255 seconds from the watchdog timer are passed.
> 
> Christian
> 
> > On 13. Mar 2023, at 18:13, Corey Minyard <miny...@acm.org> wrote:
> > 
> > On Mon, Mar 13, 2023 at 05:42:39PM +0100, Christian Theune wrote:
> >> Hrghs. I’m applying your patch to 5.10 as my distro build infrastructure 
> >> has some patches that don’t apply to 6.2 and that I don’t know how to 
> >> circumvent quickly enough… :)
> > 
> > Ok, there's a
> > 
> > https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events-5.10
> > 
> > branch available for you to pull.  It's on top of latest 5.10.
> > 
> > -corey
> > 
> >> 
> >>> On 13. Mar 2023, at 16:59, Christian Theune <c...@flyingcircus.io> wrote:
> >>> 
> >>> I should be easily able to run 6.2, no worries.
> >>> 
> >>> 
> >>>> On 13. Mar 2023, at 16:33, Corey Minyard <miny...@acm.org> wrote:
> >>>> 
> >>>> On Mon, Mar 13, 2023 at 02:07:01PM +0100, Christian Theune wrote:
> >>>>> Hi,
> >>>>> 
> >>>>> yeah, the IPMI log is fine. This is a 10 minute interval job in our 
> >>>>> system that exports the log and clears it:
> >>>>> 
> >>>>> The job looks like this:
> >>>>> 
> >>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool
> >>>>>  sel elist
> >>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool
> >>>>>  sel clear
> >>>>> 
> >>>>> So it’s not atomic but it runs after the boot and the elist should 
> >>>>> output it properly … at least it did in the past. ;)
> >>>>> 
> >>>>> As I said - I’m happy to run any patches you have. If you point me to a 
> >>>>> git branch somewhere I can switch that system easily.
> >>>> 
> >>>> Ok, I have a branch at
> >>>> 
> >>>> https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events
> >>>> 
> >>>> that has debug tracing.  It will print the function for all panic event
> >>>> handlers, their return values, and adds traces in the IPMI panic event
> >>>> handlers.
> >>>> 
> >>>> It's a single patch right on top of 6.2; I'm not sure how portable it is
> >>>> to other kernel versions.  I can port if you like.
> >>>> 
> >>>> Thanks,
> >>>> 
> >>>> -corey
> >>>> 
> >>>>> 
> >>>>> Cheers,
> >>>>> Christian
> >>>>> 
> >>>>>>> On 13. Mar 2023, at 13:58, Corey Minyard <miny...@acm.org> wrote:
> >>>>>> 
> >>>>>> On Mon, Mar 13, 2023 at 10:27:51AM +0100, Christian Theune wrote:
> >>>>>>> Hi,
> >>>>>>> 
> >>>>>>> alright, so here’s the output from the NixOS machine:
> >>>>>>> 
> >>>>>>> root@xxx ~ # echo c >/proc/sysrq-trigger
> >>>>>>> client_loop: send disconnect: Broken pipe
> >>>>>>> …
> >>>>>>> 
> >>>>>>> root@xxx ~ # journalctl -u ipmi-log.service
> >>>>>>> -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon 
> >>>>>>> 2023-03-13 10:25:27 CET. --
> >>>>>>> Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL.  Please 
> >>>>>>> allow a few seconds to erase.
> >>>>>>> ...
> >>>>>>> -- Boot fdef496e784e4541abd9ae40df472a0b --
> >>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]:    1 | 03/13/2023 | 
> >>>>>>> 09:12:49 | Event Logging Disabled SEL | Log area reset/cleared | 
> >>>>>>> Asserted
> >>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]:    2 | 03/13/2023 | 
> >>>>>>> 09:21:06 | Watchdog2 OS Watchdog | Hard reset | Asserted
> >>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL.  Please allow 
> >>>>>>> a few seconds to erase.
> >>>>>> 
> >>>>>> Hmm, the SEL got cleared.  That would clear out any of the logs that
> >>>>>> were issued before that time.  I'm not sure when the above happened
> >>>>>> verses the crash, though.  It looks like it occurred as part of the
> >>>>>> reboot, but I'm not sure what I'm seeing.  Maybe you have a startup
> >>>>>> process that clears the SEL?
> >>>>>> 
> >>>>>> Assuming that's not the issue, what you have looks ok.  I'd need to add
> >>>>>> some logs to the kernel to see if the log operation ever happens.
> >>>>>> 
> >>>>>> -corey
> >>>>>> 
> >>>>>>> 
> >>>>>>> The SOL log looks like this:
> >>>>>>> 
> >>>>>>> 
> >>>>>>> [1107585.917689] sysrq: Trigger a crash
> >>>>>>> [1107585.921272] Kernel panic - not syncing: sysrq triggered crash
> >>>>>>> [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G          I  
> >>>>>>>      5.10.159 #1-NixOS
> >>>>>>> [1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 
> >>>>>>> 1.11.0 07/23/2012
> >>>>>>> [1107585.943058] Call Trace:
> >>>>>>> [1107585.945680]  dump_stack+0x6b/0x83
> >>>>>>> [1107585.949158]  panic+0x101/0x2c8
> >>>>>>> [1107585.952379]  ? printk+0x58/0x73
> >>>>>>> [1107585.955687]  sysrq_handle_crash+0x16/0x20
> >>>>>>> [1107585.959859]  __handle_sysrq.cold+0x43/0x11a
> >>>>>>> [1107585.964203]  write_sysrq_trigger+0x24/0x40
> >>>>>>> [1107585.968463]  proc_reg_write+0x51/0x90
> >>>>>>> [1107585.972290]  vfs_write+0xc3/0x280
> >>>>>>> [1107585.975768]  ksys_write+0x5f/0xe0
> >>>>>>> [1107585.979248]  do_syscall_64+0x33/0x40
> >>>>>>> [1107585.982987]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
> >>>>>>> [1107585.988199] RIP: 0033:0x7f5873932133
> >>>>>>> [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb 
> >>>>>>> b3 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 
> >>>>>>> 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 
> >>>>>>> 55 48 89 f5
> >>>>>>> [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 
> >>>>>>> ORIG_RAX: 0000000000000001
> >>>>>>> [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 
> >>>>>>> 00007f5873932133
> >>>>>>> [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: 
> >>>>>>> 0000000000000001
> >>>>>>> [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: 
> >>>>>>> 00007f58739c2f40
> >>>>>>> [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: 
> >>>>>>> 0000000000000002
> >>>>>>> [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: 
> >>>>>>> 0000000000000002
> >>>>>>> 
> >>>>>>> Nothing obvious to me here … if you have any further ideas what to 
> >>>>>>> test, let me know. I should be more responsive again now.
> >>>>>>> 
> >>>>>>> Thanks and kind regards,
> >>>>>>> Christian
> >>>>>>> 
> >>>>>>>> On 5. Mar 2023, at 23:53, Corey Minyard <miny...@acm.org> wrote:
> >>>>>>>> 
> >>>>>>>> On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote:
> >>>>>>>>> I’m going to actually attach a serial console to watch the “echo c” 
> >>>>>>>>> panic, maybe that gives _some_ indication.
> >>>>>>>>> 
> >>>>>>>>> Otherwise: I can quickly run patches on the kernel there to try out 
> >>>>>>>>> things. (And the funding offer still stands.)
> >>>>>>>> 
> >>>>>>>> Any news on this?  I'm curious what this could be.
> >>>>>>>> 
> >>>>>>>> -corey
> >>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Christian
> >>>>>>>>> 
> >>>>>>>>>> On 1. Mar 2023, at 17:58, Corey Minyard <miny...@acm.org> wrote:
> >>>>>>>>>> 
> >>>>>>>>>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote:
> >>>>>>>>>>> Thanks, both machines report:
> >>>>>>>>>>> 
> >>>>>>>>>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op
> >>>>>>>>>>> string
> >>>>>>>>>> 
> >>>>>>>>>> At this point, I have no idea.  I'd have to start adding printks 
> >>>>>>>>>> into
> >>>>>>>>>> the code and cause crashes to see what is happing.
> >>>>>>>>>> 
> >>>>>>>>>> Maybe something is getting in the way of the panic notifiers and 
> >>>>>>>>>> doing
> >>>>>>>>>> something to prevent the IPMI driver from working.
> >>>>>>>>>> 
> >>>>>>>>>> -corey
> >>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>>> On 28. Feb 2023, at 18:04, Corey Minyard <miny...@acm.org> wrote:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Oh, I forgot.  You can look at panic_op in 
> >>>>>>>>>>>> /sys/module/ipmi_msghandler/parameters/panic_op
> >>>>>>>>>>>> 
> >>>>>>>>>>>> -corey
> >>>>>>>>>>>> 
> >>>>>>>>>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via 
> >>>>>>>>>>>> Openipmi-developer wrote:
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <miny...@acm.org> 
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via 
> >>>>>>>>>>>>>> Openipmi-developer wrote:
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> I’ve been trying to debug the PANIC and OEM string handling 
> >>>>>>>>>>>>>>> and am running out of ideas whether this is a bug or whether 
> >>>>>>>>>>>>>>> something so subtle has changed in my config that I’m just 
> >>>>>>>>>>>>>>> not seeing it.
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> (Note: I’m willing to pay for consulting.)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Probably not necessary.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Thanks! The offer always stands. If we should ever meet I’m 
> >>>>>>>>>>>>> also able to pay in beverages. ;)
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, 
> >>>>>>>>>>>>>>> (mostly) vanilla kernel 4.19.157) to a newer setup (NixOS, 
> >>>>>>>>>>>>>>> (mostly) vanilla kernel 5.10.159) and I’m now experiencing 
> >>>>>>>>>>>>>>> crashes that seem to be kernel panics but do not get the 
> >>>>>>>>>>>>>>> usual messages in the IPMI SEL.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I just tested on stock 5.10.159 and it worked without issue.  
> >>>>>>>>>>>>>> Everything
> >>>>>>>>>>>>>> you have below looks ok.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Can you test by causing a crash with:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> echo c >/proc/sysrq-trigger
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> and see if it works?
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>> It sounds like you are having some type of crash that you 
> >>>>>>>>>>>>>> would normally
> >>>>>>>>>>>>>> use the IPMI logs to debug.  However, they aren't perfect, the 
> >>>>>>>>>>>>>> system
> >>>>>>>>>>>>>> has to stay up long enough to get them into the event log.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> I think they are staying up long enough because a panic 
> >>>>>>>>>>>>> triggers the 255 second bump in the watchdog and only then pass 
> >>>>>>>>>>>>> on. However, i’ve also noticed that the kernel _should_ be 
> >>>>>>>>>>>>> rebooting after a panic much faster (and not rely on the 
> >>>>>>>>>>>>> watchdog) and that doesn’t happen either. (Sorry this just 
> >>>>>>>>>>>>> popped from the back of my head).
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>> In this situation, getting a serial console (probably through 
> >>>>>>>>>>>>>> IPMI
> >>>>>>>>>>>>>> Serial over LAN) and getting the console output on a crash is 
> >>>>>>>>>>>>>> probably
> >>>>>>>>>>>>>> your best option.  You can use ipmitool for this, or I have a 
> >>>>>>>>>>>>>> library
> >>>>>>>>>>>>>> that is able to make connections to serial ports, including 
> >>>>>>>>>>>>>> through IPMI
> >>>>>>>>>>>>>> SoL.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Yup. Been there, too. :)
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Unfortunately we’re currently chasing something that pops up 
> >>>>>>>>>>>>> very randomly on somewhat odd machines and I also have the 
> >>>>>>>>>>>>> feeling that it’s systematically broken right now (as the “echo 
> >>>>>>>>>>>>> c” doesn’t work).
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Thanks a lot,
> >>>>>>>>>>>>> Christian
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> -- 
> >>>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> >>>>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >>>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >>>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, 
> >>>>>>>>>>>>> Christian Zagrodnick
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> Openipmi-developer mailing list
> >>>>>>>>>>>>> Openipmi-developer@lists.sourceforge.net
> >>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer
> >>>>>>>>>>> 
> >>>>>>>>>>> Liebe Grüße,
> >>>>>>>>>>> Christian Theune
> >>>>>>>>>>> 
> >>>>>>>>>>> -- 
> >>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> >>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, 
> >>>>>>>>>>> Christian Zagrodnick
> >>>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Liebe Grüße,
> >>>>>>>>> Christian Theune
> >>>>>>>>> 
> >>>>>>>>> -- 
> >>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> >>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
> >>>>>>>>> Zagrodnick
> >>>>>>>>> 
> >>>>>>> 
> >>>>>>> Liebe Grüße,
> >>>>>>> Christian Theune
> >>>>>>> 
> >>>>>>> -- 
> >>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> >>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
> >>>>>>> Zagrodnick
> >>>>> 
> >>>>> 
> >>>>> Liebe Grüße,
> >>>>> Christian Theune
> >>>>> 
> >>>>> -- 
> >>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> >>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
> >>>>> Zagrodnick
> >>>>> 
> >> 
> >> Liebe Grüße,
> >> Christian Theune
> >> 
> >> -- 
> >> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> >> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
> >> Zagrodnick
> >> 
> 
> Liebe Grüße,
> Christian Theune
> 
> -- 
> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> 
> 
> 
> _______________________________________________
> Openipmi-developer mailing list
> Openipmi-developer@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openipmi-developer


_______________________________________________
Openipmi-developer mailing list
Openipmi-developer@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to