Hi,

yeah, the IPMI log is fine. This is a 10 minute interval job in our system that 
exports the log and clears it:

The job looks like this:

/nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool sel 
elist
/nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool sel 
clear

So it’s not atomic but it runs after the boot and the elist should output it 
properly … at least it did in the past. ;)

As I said - I’m happy to run any patches you have. If you point me to a git 
branch somewhere I can switch that system easily.
 
Cheers,
Christian

> On 13. Mar 2023, at 13:58, Corey Minyard <[email protected]> wrote:
> 
> On Mon, Mar 13, 2023 at 10:27:51AM +0100, Christian Theune wrote:
>> Hi,
>> 
>> alright, so here’s the output from the NixOS machine:
>> 
>> root@xxx ~ # echo c >/proc/sysrq-trigger
>> client_loop: send disconnect: Broken pipe
>> …
>> 
>> root@xxx ~ # journalctl -u ipmi-log.service
>> -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon 2023-03-13 
>> 10:25:27 CET. --
>> Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL.  Please allow a 
>> few seconds to erase.
>> ...
>> -- Boot fdef496e784e4541abd9ae40df472a0b --
>> Mar 13 10:25:07 xxx ipmi-log-start[1973]:    1 | 03/13/2023 | 09:12:49 | 
>> Event Logging Disabled SEL | Log area reset/cleared | Asserted
>> Mar 13 10:25:07 xxx ipmi-log-start[1973]:    2 | 03/13/2023 | 09:21:06 | 
>> Watchdog2 OS Watchdog | Hard reset | Asserted
>> Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL.  Please allow a few 
>> seconds to erase.
> 
> Hmm, the SEL got cleared.  That would clear out any of the logs that
> were issued before that time.  I'm not sure when the above happened
> verses the crash, though.  It looks like it occurred as part of the
> reboot, but I'm not sure what I'm seeing.  Maybe you have a startup
> process that clears the SEL?
> 
> Assuming that's not the issue, what you have looks ok.  I'd need to add
> some logs to the kernel to see if the log operation ever happens.
> 
> -corey
> 
>> 
>> The SOL log looks like this:
>> 
>> 
>> [1107585.917689] sysrq: Trigger a crash
>> [1107585.921272] Kernel panic - not syncing: sysrq triggered crash
>> [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G          I       
>> 5.10.159 #1-NixOS
>> [1107585.935335] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 1.11.0 
>> 07/23/2012
>> [1107585.943058] Call Trace:
>> [1107585.945680]  dump_stack+0x6b/0x83
>> [1107585.949158]  panic+0x101/0x2c8
>> [1107585.952379]  ? printk+0x58/0x73
>> [1107585.955687]  sysrq_handle_crash+0x16/0x20
>> [1107585.959859]  __handle_sysrq.cold+0x43/0x11a
>> [1107585.964203]  write_sysrq_trigger+0x24/0x40
>> [1107585.968463]  proc_reg_write+0x51/0x90
>> [1107585.972290]  vfs_write+0xc3/0x280
>> [1107585.975768]  ksys_write+0x5f/0xe0
>> [1107585.979248]  do_syscall_64+0x33/0x40
>> [1107585.982987]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
>> [1107585.988199] RIP: 0033:0x7f5873932133
>> [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 
>> 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 
>> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 89 f5
>> [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 ORIG_RAX: 
>> 0000000000000001
>> [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 
>> 00007f5873932133
>> [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 RDI: 
>> 0000000000000001
>> [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a R09: 
>> 00007f58739c2f40
>> [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 R12: 
>> 0000000000000002
>> [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 R15: 
>> 0000000000000002
>> 
>> Nothing obvious to me here … if you have any further ideas what to test, let 
>> me know. I should be more responsive again now.
>> 
>> Thanks and kind regards,
>> Christian
>> 
>>> On 5. Mar 2023, at 23:53, Corey Minyard <[email protected]> wrote:
>>> 
>>> On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune wrote:
>>>> I’m going to actually attach a serial console to watch the “echo c” panic, 
>>>> maybe that gives _some_ indication.
>>>> 
>>>> Otherwise: I can quickly run patches on the kernel there to try out 
>>>> things. (And the funding offer still stands.)
>>> 
>>> Any news on this?  I'm curious what this could be.
>>> 
>>> -corey
>>> 
>>>> 
>>>> Christian
>>>> 
>>>>> On 1. Mar 2023, at 17:58, Corey Minyard <[email protected]> wrote:
>>>>> 
>>>>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune wrote:
>>>>>> Thanks, both machines report:
>>>>>> 
>>>>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op
>>>>>> string
>>>>> 
>>>>> At this point, I have no idea.  I'd have to start adding printks into
>>>>> the code and cause crashes to see what is happing.
>>>>> 
>>>>> Maybe something is getting in the way of the panic notifiers and doing
>>>>> something to prevent the IPMI driver from working.
>>>>> 
>>>>> -corey
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 28. Feb 2023, at 18:04, Corey Minyard <[email protected]> wrote:
>>>>>>> 
>>>>>>> Oh, I forgot.  You can look at panic_op in 
>>>>>>> /sys/module/ipmi_msghandler/parameters/panic_op
>>>>>>> 
>>>>>>> -corey
>>>>>>> 
>>>>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune via 
>>>>>>> Openipmi-developer wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian Theune via 
>>>>>>>>> Openipmi-developer wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I’ve been trying to debug the PANIC and OEM string handling and am 
>>>>>>>>>> running out of ideas whether this is a bug or whether something so 
>>>>>>>>>> subtle has changed in my config that I’m just not seeing it.
>>>>>>>>>> 
>>>>>>>>>> (Note: I’m willing to pay for consulting.)
>>>>>>>>> 
>>>>>>>>> Probably not necessary.
>>>>>>>> 
>>>>>>>> Thanks! The offer always stands. If we should ever meet I’m also able 
>>>>>>>> to pay in beverages. ;)
>>>>>>>> 
>>>>>>>>>> I have machines that we’ve moved from an older setup (Gentoo, 
>>>>>>>>>> (mostly) vanilla kernel 4.19.157) to a newer setup (NixOS, (mostly) 
>>>>>>>>>> vanilla kernel 5.10.159) and I’m now experiencing crashes that seem 
>>>>>>>>>> to be kernel panics but do not get the usual messages in the IPMI 
>>>>>>>>>> SEL.
>>>>>>>>> 
>>>>>>>>> I just tested on stock 5.10.159 and it worked without issue.  
>>>>>>>>> Everything
>>>>>>>>> you have below looks ok.
>>>>>>>>> 
>>>>>>>>> Can you test by causing a crash with:
>>>>>>>>> 
>>>>>>>>> echo c >/proc/sysrq-trigger
>>>>>>>>> 
>>>>>>>>> and see if it works?
>>>>>>>> 
>>>>>>>> Yeah, already tried that and unfortunately that _doesn’t_ work.
>>>>>>>> 
>>>>>>>>> It sounds like you are having some type of crash that you would 
>>>>>>>>> normally
>>>>>>>>> use the IPMI logs to debug.  However, they aren't perfect, the system
>>>>>>>>> has to stay up long enough to get them into the event log.
>>>>>>>> 
>>>>>>>> I think they are staying up long enough because a panic triggers the 
>>>>>>>> 255 second bump in the watchdog and only then pass on. However, i’ve 
>>>>>>>> also noticed that the kernel _should_ be rebooting after a panic much 
>>>>>>>> faster (and not rely on the watchdog) and that doesn’t happen either. 
>>>>>>>> (Sorry this just popped from the back of my head).
>>>>>>>> 
>>>>>>>>> In this situation, getting a serial console (probably through IPMI
>>>>>>>>> Serial over LAN) and getting the console output on a crash is probably
>>>>>>>>> your best option.  You can use ipmitool for this, or I have a library
>>>>>>>>> that is able to make connections to serial ports, including through 
>>>>>>>>> IPMI
>>>>>>>>> SoL.
>>>>>>>> 
>>>>>>>> Yup. Been there, too. :)
>>>>>>>> 
>>>>>>>> Unfortunately we’re currently chasing something that pops up very 
>>>>>>>> randomly on somewhat odd machines and I also have the feeling that 
>>>>>>>> it’s systematically broken right now (as the “echo c” doesn’t work).
>>>>>>>> 
>>>>>>>> Thanks a lot,
>>>>>>>> Christian
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Christian Theune · [email protected] · +49 345 219401 0
>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>>>>>> Zagrodnick
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Openipmi-developer mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer
>>>>>> 
>>>>>> Liebe Grüße,
>>>>>> Christian Theune
>>>>>> 
>>>>>> -- 
>>>>>> Christian Theune · [email protected] · +49 345 219401 0
>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>>>> Zagrodnick
>>>>>> 
>>>> 
>>>> Liebe Grüße,
>>>> Christian Theune
>>>> 
>>>> -- 
>>>> Christian Theune · [email protected] · +49 345 219401 0
>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>> Zagrodnick
>>>> 
>> 
>> Liebe Grüße,
>> Christian Theune
>> 
>> -- 
>> Christian Theune · [email protected] · +49 345 219401 0
>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>> Zagrodnick


Liebe Grüße,
Christian Theune

-- 
Christian Theune · [email protected] · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to