Hi,

alright, so here’s the output from the NixOS machine:

root@xxx ~ # echo c >/proc/sysrq-trigger
client_loop: send disconnect: Broken pipe
…

root@xxx ~ # journalctl -u ipmi-log.service
-- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon 2023-03-13 
10:25:27 CET. --
Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL.  Please allow a few 
seconds to erase.
...
-- Boot fdef496e784e4541abd9ae40df472a0b --
Mar 13 10:25:07 xxx ipmi-log-start[1973]:    1 | 03/13/2023 | 09:12:49 | Event 
Logging Disabled SEL | Log area reset/cleared | Asserted
Mar 13 10:25:07 xxx ipmi-log-start[1973]:    2 | 03/13/2023 | 09:21:06 | 
Watchdog2 OS Watchdog | Hard reset | Asserted
Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL.  Please allow a few 
seconds to erase.

The SOL log looks like this:


[1107585.917689] sysrq: Trigger a crash
[1107585.921272] Kernel panic - not syncing: sysrq triggered crash
[1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G          I       
5.10.159 #1-NixOS
[1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 1.11.0 
07/23/2012
[1107585.943058] Call Trace:
[1107585.945680]  dump_stack+0x6b/0x83
[1107585.949158]  panic+0x101/0x2c8
[1107585.952379]  ? printk+0x58/0x73
[1107585.955687]  sysrq_handle_crash+0x16/0x20
[1107585.959859]  __handle_sysrq.cold+0x43/0x11a
[1107585.964203]  write_sysrq_trigger+0x24/0x40
[1107585.968463]  proc_reg_write+0x51/0x90
[1107585.972290]  vfs_write+0xc3/0x280
[1107585.975768]  ksys_write+0x5f/0xe0
[1107585.979248]  do_syscall_64+0x33/0x40
[1107585.982987]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[1107585.988199] RIP: 0033:0x7f5873932133
[1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 
00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5
[1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
[1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 
00007f5873932133
[1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: 
0000000000000001
[1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: 
00007f58739c2f40
[1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: 
0000000000000002
[1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: 
0000000000000002

Nothing obvious to me here … if you have any further ideas what to test, let me 
know. I should be more responsive again now.

Thanks and kind regards,
Christian

> On 5. Mar 2023, at 23:53, Corey Minyard <[email protected]> wrote:
> 
> On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote:
>> I’m going to actually attach a serial console to watch the “echo c” panic, 
>> maybe that gives _some_ indication.
>> 
>> Otherwise: I can quickly run patches on the kernel there to try out things. 
>> (And the funding offer still stands.)
> 
> Any news on this?  I'm curious what this could be.
> 
> -corey
> 
>> 
>> Christian
>> 
>>> On 1. Mar 2023, at 17:58, Corey Minyard <[email protected]> wrote:
>>> 
>>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote:
>>>> Thanks, both machines report:
>>>> 
>>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op
>>>> string
>>> 
>>> At this point, I have no idea.  I'd have to start adding printks into
>>> the code and cause crashes to see what is happing.
>>> 
>>> Maybe something is getting in the way of the panic notifiers and doing
>>> something to prevent the IPMI driver from working.
>>> 
>>> -corey
>>> 
>>>> 
>>>> 
>>>>> On 28. Feb 2023, at 18:04, Corey Minyard <[email protected]> wrote:
>>>>> 
>>>>> Oh, I forgot.  You can look at panic_op in 
>>>>> /sys/module/ipmi_msghandler/parameters/panic_op
>>>>> 
>>>>> -corey
>>>>> 
>>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via 
>>>>> Openipmi-developer wrote:
>>>>>> Hi,
>>>>>> 
>>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <[email protected]> wrote:
>>>>>>> 
>>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via 
>>>>>>> Openipmi-developer wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I’ve been trying to debug the PANIC and OEM string handling and am 
>>>>>>>> running out of ideas whether this is a bug or whether something so 
>>>>>>>> subtle has changed in my config that I’m just not seeing it.
>>>>>>>> 
>>>>>>>> (Note: I’m willing to pay for consulting.)
>>>>>>> 
>>>>>>> Probably not necessary.
>>>>>> 
>>>>>> Thanks! The offer always stands. If we should ever meet I’m also able to 
>>>>>> pay in beverages. ;)
>>>>>> 
>>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, (mostly) 
>>>>>>>> vanilla kernel 4.19.157) to a newer setup (NixOS, (mostly) vanilla 
>>>>>>>> kernel 5.10.159) and I’m now experiencing crashes that seem to be 
>>>>>>>> kernel panics but do not get the usual messages in the IPMI SEL.
>>>>>>> 
>>>>>>> I just tested on stock 5.10.159 and it worked without issue.  Everything
>>>>>>> you have below looks ok.
>>>>>>> 
>>>>>>> Can you test by causing a crash with:
>>>>>>> 
>>>>>>> echo c >/proc/sysrq-trigger
>>>>>>> 
>>>>>>> and see if it works?
>>>>>> 
>>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work.
>>>>>> 
>>>>>>> It sounds like you are having some type of crash that you would normally
>>>>>>> use the IPMI logs to debug.  However, they aren't perfect, the system
>>>>>>> has to stay up long enough to get them into the event log.
>>>>>> 
>>>>>> I think they are staying up long enough because a panic triggers the 255 
>>>>>> second bump in the watchdog and only then pass on. However, i’ve also 
>>>>>> noticed that the kernel _should_ be rebooting after a panic much faster 
>>>>>> (and not rely on the watchdog) and that doesn’t happen either. (Sorry 
>>>>>> this just popped from the back of my head).
>>>>>> 
>>>>>>> In this situation, getting a serial console (probably through IPMI
>>>>>>> Serial over LAN) and getting the console output on a crash is probably
>>>>>>> your best option.  You can use ipmitool for this, or I have a library
>>>>>>> that is able to make connections to serial ports, including through IPMI
>>>>>>> SoL.
>>>>>> 
>>>>>> Yup. Been there, too. :)
>>>>>> 
>>>>>> Unfortunately we’re currently chasing something that pops up very 
>>>>>> randomly on somewhat odd machines and I also have the feeling that it’s 
>>>>>> systematically broken right now (as the “echo c” doesn’t work).
>>>>>> 
>>>>>> Thanks a lot,
>>>>>> Christian
>>>>>> 
>>>>>> -- 
>>>>>> Christian Theune · [email protected] · +49 345 219401 0
>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>>>> Zagrodnick
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Openipmi-developer mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer
>>>> 
>>>> Liebe Grüße,
>>>> Christian Theune
>>>> 
>>>> -- 
>>>> Christian Theune · [email protected] · +49 345 219401 0
>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>> Zagrodnick
>>>> 
>> 
>> Liebe Grüße,
>> Christian Theune
>> 
>> -- 
>> Christian Theune · [email protected] · +49 345 219401 0
>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>> Zagrodnick
>> 

Liebe Grüße,
Christian Theune

-- 
Christian Theune · [email protected] · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to