Hi,

sorry, I didn’t expect you to make me a branch. I had already taken your diff 
over to 5.10 as it applied cleanly … sorry for the additional work and thanks 
anyways.

Here’s the output:

[ 6521.905890] sysrq: Trigger a crash
[ 6521.909294] Kernel panic - not syncing: sysrq triggered crash
[ 6521.915026] CPU: 1 PID: 43785 Comm: bash Tainted: G          I       
5.10.159 #1-NixOS
[ 6521.922925] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 1.11.0 
07/23/2012
[ 6521.930475] Call Trace:
[ 6521.932923]  dump_stack+0x6b/0x83
[ 6521.936230]  panic+0x101/0x2c8
[ 6521.939276]  ? printk+0x58/0x73
[ 6521.942408]  sysrq_handle_crash+0x16/0x20
[ 6521.946407]  __handle_sysrq.cold+0x43/0x11a
[ 6521.950580]  write_sysrq_trigger+0x24/0x40
[ 6521.954668]  proc_reg_write+0x51/0x90
[ 6521.958322]  vfs_write+0xc3/0x280
[ 6521.961627]  ksys_write+0x5f/0xe0
[ 6521.964935]  do_syscall_64+0x33/0x40
[ 6521.968502]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 6521.973540] RIP: 0033:0x7f2c6b91a133
[ 6521.977106] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 
00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5
[ 6521.995836] RSP: 002b:00007ffc4cf11088 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
[ 6522.003387] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f2c6b91a133
[ 6522.010505] RDX: 0000000000000002 RSI: 0000000001555c08 RDI: 0000000000000001
[ 6522.017623] RBP: 0000000001555c08 R08: 000000000000000a R09: 00007f2c6b9aaf40
[ 6522.024743] R10: 00000000016e4218 R11: 0000000000000246 R12: 0000000000000002
[ 6522.031864] R13: 00007f2c6b9e8520 R14: 00007f2c6b9e8720 R15: 0000000000000002
[ 6522.039085] Calling notifier panic_event+0x0/0x410 [ipmi_msghandler] 
(000000008eb8cb44)
[ 6522.047071] IPMI message handler: IPMI: panic event handler
[ 6522.052628] IPMI message handler: IPMI: handling panic event for intf 0: 
00000000443777b3 0000000067d05ff8
…
and then it reboots after the 255 seconds from the watchdog timer are passed.

Christian

> On 13. Mar 2023, at 18:13, Corey Minyard <miny...@acm.org> wrote:
> 
> On Mon, Mar 13, 2023 at 05:42:39PM +0100, Christian Theune wrote:
>> Hrghs. I’m applying your patch to 5.10 as my distro build infrastructure has 
>> some patches that don’t apply to 6.2 and that I don’t know how to circumvent 
>> quickly enough… :)
> 
> Ok, there's a
> 
> https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events-5.10
> 
> branch available for you to pull.  It's on top of latest 5.10.
> 
> -corey
> 
>> 
>>> On 13. Mar 2023, at 16:59, Christian Theune <c...@flyingcircus.io> wrote:
>>> 
>>> I should be easily able to run 6.2, no worries.
>>> 
>>> 
>>>> On 13. Mar 2023, at 16:33, Corey Minyard <miny...@acm.org> wrote:
>>>> 
>>>> On Mon, Mar 13, 2023 at 02:07:01PM +0100, Christian Theune wrote:
>>>>> Hi,
>>>>> 
>>>>> yeah, the IPMI log is fine. This is a 10 minute interval job in our 
>>>>> system that exports the log and clears it:
>>>>> 
>>>>> The job looks like this:
>>>>> 
>>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool 
>>>>> sel elist
>>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool 
>>>>> sel clear
>>>>> 
>>>>> So it’s not atomic but it runs after the boot and the elist should output 
>>>>> it properly … at least it did in the past. ;)
>>>>> 
>>>>> As I said - I’m happy to run any patches you have. If you point me to a 
>>>>> git branch somewhere I can switch that system easily.
>>>> 
>>>> Ok, I have a branch at
>>>> 
>>>> https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events
>>>> 
>>>> that has debug tracing.  It will print the function for all panic event
>>>> handlers, their return values, and adds traces in the IPMI panic event
>>>> handlers.
>>>> 
>>>> It's a single patch right on top of 6.2; I'm not sure how portable it is
>>>> to other kernel versions.  I can port if you like.
>>>> 
>>>> Thanks,
>>>> 
>>>> -corey
>>>> 
>>>>> 
>>>>> Cheers,
>>>>> Christian
>>>>> 
>>>>>>> On 13. Mar 2023, at 13:58, Corey Minyard <miny...@acm.org> wrote:
>>>>>> 
>>>>>> On Mon, Mar 13, 2023 at 10:27:51AM +0100, Christian Theune wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> alright, so here’s the output from the NixOS machine:
>>>>>>> 
>>>>>>> root@xxx ~ # echo c >/proc/sysrq-trigger
>>>>>>> client_loop: send disconnect: Broken pipe
>>>>>>> …
>>>>>>> 
>>>>>>> root@xxx ~ # journalctl -u ipmi-log.service
>>>>>>> -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon 
>>>>>>> 2023-03-13 10:25:27 CET. --
>>>>>>> Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL.  Please allow 
>>>>>>> a few seconds to erase.
>>>>>>> ...
>>>>>>> -- Boot fdef496e784e4541abd9ae40df472a0b --
>>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]:    1 | 03/13/2023 | 09:12:49 
>>>>>>> | Event Logging Disabled SEL | Log area reset/cleared | Asserted
>>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]:    2 | 03/13/2023 | 09:21:06 
>>>>>>> | Watchdog2 OS Watchdog | Hard reset | Asserted
>>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL.  Please allow a 
>>>>>>> few seconds to erase.
>>>>>> 
>>>>>> Hmm, the SEL got cleared.  That would clear out any of the logs that
>>>>>> were issued before that time.  I'm not sure when the above happened
>>>>>> verses the crash, though.  It looks like it occurred as part of the
>>>>>> reboot, but I'm not sure what I'm seeing.  Maybe you have a startup
>>>>>> process that clears the SEL?
>>>>>> 
>>>>>> Assuming that's not the issue, what you have looks ok.  I'd need to add
>>>>>> some logs to the kernel to see if the log operation ever happens.
>>>>>> 
>>>>>> -corey
>>>>>> 
>>>>>>> 
>>>>>>> The SOL log looks like this:
>>>>>>> 
>>>>>>> 
>>>>>>> [1107585.917689] sysrq: Trigger a crash
>>>>>>> [1107585.921272] Kernel panic - not syncing: sysrq triggered crash
>>>>>>> [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G          I    
>>>>>>>    5.10.159 #1-NixOS
>>>>>>> [1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 
>>>>>>> 1.11.0 07/23/2012
>>>>>>> [1107585.943058] Call Trace:
>>>>>>> [1107585.945680]  dump_stack+0x6b/0x83
>>>>>>> [1107585.949158]  panic+0x101/0x2c8
>>>>>>> [1107585.952379]  ? printk+0x58/0x73
>>>>>>> [1107585.955687]  sysrq_handle_crash+0x16/0x20
>>>>>>> [1107585.959859]  __handle_sysrq.cold+0x43/0x11a
>>>>>>> [1107585.964203]  write_sysrq_trigger+0x24/0x40
>>>>>>> [1107585.968463]  proc_reg_write+0x51/0x90
>>>>>>> [1107585.972290]  vfs_write+0xc3/0x280
>>>>>>> [1107585.975768]  ksys_write+0x5f/0xe0
>>>>>>> [1107585.979248]  do_syscall_64+0x33/0x40
>>>>>>> [1107585.982987]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
>>>>>>> [1107585.988199] RIP: 0033:0x7f5873932133
>>>>>>> [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 
>>>>>>> 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 
>>>>>>> 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 
>>>>>>> f5
>>>>>>> [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 ORIG_RAX: 
>>>>>>> 0000000000000001
>>>>>>> [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 
>>>>>>> 00007f5873932133
>>>>>>> [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: 
>>>>>>> 0000000000000001
>>>>>>> [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: 
>>>>>>> 00007f58739c2f40
>>>>>>> [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: 
>>>>>>> 0000000000000002
>>>>>>> [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: 
>>>>>>> 0000000000000002
>>>>>>> 
>>>>>>> Nothing obvious to me here … if you have any further ideas what to 
>>>>>>> test, let me know. I should be more responsive again now.
>>>>>>> 
>>>>>>> Thanks and kind regards,
>>>>>>> Christian
>>>>>>> 
>>>>>>>> On 5. Mar 2023, at 23:53, Corey Minyard <miny...@acm.org> wrote:
>>>>>>>> 
>>>>>>>> On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote:
>>>>>>>>> I’m going to actually attach a serial console to watch the “echo c” 
>>>>>>>>> panic, maybe that gives _some_ indication.
>>>>>>>>> 
>>>>>>>>> Otherwise: I can quickly run patches on the kernel there to try out 
>>>>>>>>> things. (And the funding offer still stands.)
>>>>>>>> 
>>>>>>>> Any news on this?  I'm curious what this could be.
>>>>>>>> 
>>>>>>>> -corey
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Christian
>>>>>>>>> 
>>>>>>>>>> On 1. Mar 2023, at 17:58, Corey Minyard <miny...@acm.org> wrote:
>>>>>>>>>> 
>>>>>>>>>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote:
>>>>>>>>>>> Thanks, both machines report:
>>>>>>>>>>> 
>>>>>>>>>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op
>>>>>>>>>>> string
>>>>>>>>>> 
>>>>>>>>>> At this point, I have no idea.  I'd have to start adding printks into
>>>>>>>>>> the code and cause crashes to see what is happing.
>>>>>>>>>> 
>>>>>>>>>> Maybe something is getting in the way of the panic notifiers and 
>>>>>>>>>> doing
>>>>>>>>>> something to prevent the IPMI driver from working.
>>>>>>>>>> 
>>>>>>>>>> -corey
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 28. Feb 2023, at 18:04, Corey Minyard <miny...@acm.org> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Oh, I forgot.  You can look at panic_op in 
>>>>>>>>>>>> /sys/module/ipmi_msghandler/parameters/panic_op
>>>>>>>>>>>> 
>>>>>>>>>>>> -corey
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via 
>>>>>>>>>>>> Openipmi-developer wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <miny...@acm.org> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via 
>>>>>>>>>>>>>> Openipmi-developer wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I’ve been trying to debug the PANIC and OEM string handling and 
>>>>>>>>>>>>>>> am running out of ideas whether this is a bug or whether 
>>>>>>>>>>>>>>> something so subtle has changed in my config that I’m just not 
>>>>>>>>>>>>>>> seeing it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (Note: I’m willing to pay for consulting.)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Probably not necessary.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks! The offer always stands. If we should ever meet I’m also 
>>>>>>>>>>>>> able to pay in beverages. ;)
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, 
>>>>>>>>>>>>>>> (mostly) vanilla kernel 4.19.157) to a newer setup (NixOS, 
>>>>>>>>>>>>>>> (mostly) vanilla kernel 5.10.159) and I’m now experiencing 
>>>>>>>>>>>>>>> crashes that seem to be kernel panics but do not get the usual 
>>>>>>>>>>>>>>> messages in the IPMI SEL.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I just tested on stock 5.10.159 and it worked without issue.  
>>>>>>>>>>>>>> Everything
>>>>>>>>>>>>>> you have below looks ok.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Can you test by causing a crash with:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> echo c >/proc/sysrq-trigger
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> and see if it works?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It sounds like you are having some type of crash that you would 
>>>>>>>>>>>>>> normally
>>>>>>>>>>>>>> use the IPMI logs to debug.  However, they aren't perfect, the 
>>>>>>>>>>>>>> system
>>>>>>>>>>>>>> has to stay up long enough to get them into the event log.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think they are staying up long enough because a panic triggers 
>>>>>>>>>>>>> the 255 second bump in the watchdog and only then pass on. 
>>>>>>>>>>>>> However, i’ve also noticed that the kernel _should_ be rebooting 
>>>>>>>>>>>>> after a panic much faster (and not rely on the watchdog) and that 
>>>>>>>>>>>>> doesn’t happen either. (Sorry this just popped from the back of 
>>>>>>>>>>>>> my head).
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In this situation, getting a serial console (probably through 
>>>>>>>>>>>>>> IPMI
>>>>>>>>>>>>>> Serial over LAN) and getting the console output on a crash is 
>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>> your best option.  You can use ipmitool for this, or I have a 
>>>>>>>>>>>>>> library
>>>>>>>>>>>>>> that is able to make connections to serial ports, including 
>>>>>>>>>>>>>> through IPMI
>>>>>>>>>>>>>> SoL.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yup. Been there, too. :)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Unfortunately we’re currently chasing something that pops up very 
>>>>>>>>>>>>> randomly on somewhat odd machines and I also have the feeling 
>>>>>>>>>>>>> that it’s systematically broken right now (as the “echo c” 
>>>>>>>>>>>>> doesn’t work).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>>> Christian
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, 
>>>>>>>>>>>>> Christian Zagrodnick
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Openipmi-developer mailing list
>>>>>>>>>>>>> Openipmi-developer@lists.sourceforge.net
>>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer
>>>>>>>>>>> 
>>>>>>>>>>> Liebe Grüße,
>>>>>>>>>>> Christian Theune
>>>>>>>>>>> 
>>>>>>>>>>> -- 
>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>>>>>>>>> Zagrodnick
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Liebe Grüße,
>>>>>>>>> Christian Theune
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>>>>>>> Zagrodnick
>>>>>>>>> 
>>>>>>> 
>>>>>>> Liebe Grüße,
>>>>>>> Christian Theune
>>>>>>> 
>>>>>>> -- 
>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>>>>> Zagrodnick
>>>>> 
>>>>> 
>>>>> Liebe Grüße,
>>>>> Christian Theune
>>>>> 
>>>>> -- 
>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>>> Zagrodnick
>>>>> 
>> 
>> Liebe Grüße,
>> Christian Theune
>> 
>> -- 
>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>> Zagrodnick
>> 

Liebe Grüße,
Christian Theune

-- 
Christian Theune · c...@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



_______________________________________________
Openipmi-developer mailing list
Openipmi-developer@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to