Thanks for fixing it!

> On 15. Mar 2023, at 13:20, Corey Minyard <miny...@acm.org> wrote:
> 
> On Wed, Mar 15, 2023 at 01:12:05PM +0100, Christian Theune via 
> Openipmi-developer wrote:
>> Ah, fantastic! That explains it of course … :)
>> 
>> From my side I guess this works and I don’t have to retry with that, but I’d 
>> be happy to just wait for 5.10.175 … or would you prefer me explicitly 
>> testing your original?
> 
> We can just wait.  The problem is obvious now, and the backports are in
> progress.
> 
> Thanks for helping me with this.
> 
> -corey
> 
>> 
>> Christian
>> 
>>> On 15. Mar 2023, at 13:07, Corey Minyard <miny...@acm.org> wrote:
>>> 
>>> On Wed, Mar 15, 2023 at 07:32:41AM +0100, Christian Theune via 
>>> Openipmi-developer wrote:
>>>> Hi,
>>>> 
>>>> that didn’t apply on 5.10. Here’s what I’m currently trying to build after 
>>>> manually inspecting the rejected patch:
>>>> 
>>> 
>>> Well, I guess I should have sent the prerequisite patch, too.  Her it
>>> is:
>>> 
>>> a01a89b1db ("ipmi/watchdog: replace atomic_add() and atomic_sub()")
>>> 
>>> Also attached.
>>> 
>>> -corey
>>> 
>>>> 
>>>> 
>>>>> On 14. Mar 2023, at 18:29, Corey Minyard <miny...@acm.org> wrote:
>>>>> 
>>>>> Well, dang, I had already fixed this a year and a half ago.  I wish I
>>>>> had a better memory.
>>>>> 
>>>>> Anyway, the fix is commit db05ddf7f321634c5659a0cf7ea56594e22365f7
>>>>> ("ipmi:watchdog: Set panic count to proper value on a panic") in
>>>>> mainstream 5.16.  I'm attaching that patch.
>>>>> 
>>>>> -corey
>>>>> 
>>>>> On Tue, Mar 14, 2023 at 03:58:26PM +0100, Christian Theune via 
>>>>> Openipmi-developer wrote:
>>>>>> Awesome!
>>>>>> 
>>>>>>> On 14. Mar 2023, at 15:54, Corey Minyard <miny...@acm.org> wrote:
>>>>>>> 
>>>>>>> On Tue, Mar 14, 2023 at 03:22:39PM +0100, Christian Theune via 
>>>>>>> Openipmi-developer wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> sorry, I didn’t expect you to make me a branch. I had already taken 
>>>>>>>> your diff over to 5.10 as it applied cleanly … sorry for the 
>>>>>>>> additional work and thanks anyways.
>>>>>>> 
>>>>>>> Ok, that's great.  It's something about the IPMI watchdog panic
>>>>>>> routines, and I can reproduce.  I should be able to fix this pretty
>>>>>>> quickly.  I'll send a patch when I get this fixed.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> -corey
>>>>>>> 
>>>>>>>> 
>>>>>>>> Here’s the output:
>>>>>>>> 
>>>>>>>> [ 6521.905890] sysrq: Trigger a crash
>>>>>>>> [ 6521.909294] Kernel panic - not syncing: sysrq triggered crash
>>>>>>>> [ 6521.915026] CPU: 1 PID: 43785 Comm: bash Tainted: G          I      
>>>>>>>>  5.10.159 #1-NixOS
>>>>>>>> [ 6521.922925] Hardware name: Dell Inc. PowerEdge R510/00HDP0, BIOS 
>>>>>>>> 1.11.0 07/23/2012
>>>>>>>> [ 6521.930475] Call Trace:
>>>>>>>> [ 6521.932923]  dump_stack+0x6b/0x83
>>>>>>>> [ 6521.936230]  panic+0x101/0x2c8
>>>>>>>> [ 6521.939276]  ? printk+0x58/0x73
>>>>>>>> [ 6521.942408]  sysrq_handle_crash+0x16/0x20
>>>>>>>> [ 6521.946407]  __handle_sysrq.cold+0x43/0x11a
>>>>>>>> [ 6521.950580]  write_sysrq_trigger+0x24/0x40
>>>>>>>> [ 6521.954668]  proc_reg_write+0x51/0x90
>>>>>>>> [ 6521.958322]  vfs_write+0xc3/0x280
>>>>>>>> [ 6521.961627]  ksys_write+0x5f/0xe0
>>>>>>>> [ 6521.964935]  do_syscall_64+0x33/0x40
>>>>>>>> [ 6521.968502]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
>>>>>>>> [ 6521.973540] RIP: 0033:0x7f2c6b91a133
>>>>>>>> [ 6521.977106] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 
>>>>>>>> 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 
>>>>>>>> 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 41 54 49 89 d4 55 48 
>>>>>>>> 89 f5
>>>>>>>> [ 6521.995836] RSP: 002b:00007ffc4cf11088 EFLAGS: 00000246 ORIG_RAX: 
>>>>>>>> 0000000000000001
>>>>>>>> [ 6522.003387] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 
>>>>>>>> 00007f2c6b91a133
>>>>>>>> [ 6522.010505] RDX: 0000000000000002 RSI: 0000000001555c08 RDI: 
>>>>>>>> 0000000000000001
>>>>>>>> [ 6522.017623] RBP: 0000000001555c08 R08: 000000000000000a R09: 
>>>>>>>> 00007f2c6b9aaf40
>>>>>>>> [ 6522.024743] R10: 00000000016e4218 R11: 0000000000000246 R12: 
>>>>>>>> 0000000000000002
>>>>>>>> [ 6522.031864] R13: 00007f2c6b9e8520 R14: 00007f2c6b9e8720 R15: 
>>>>>>>> 0000000000000002
>>>>>>>> [ 6522.039085] Calling notifier panic_event+0x0/0x410 
>>>>>>>> [ipmi_msghandler] (000000008eb8cb44)
>>>>>>>> [ 6522.047071] IPMI message handler: IPMI: panic event handler
>>>>>>>> [ 6522.052628] IPMI message handler: IPMI: handling panic event for 
>>>>>>>> intf 0: 00000000443777b3 0000000067d05ff8
>>>>>>>> …
>>>>>>>> and then it reboots after the 255 seconds from the watchdog timer are 
>>>>>>>> passed.
>>>>>>>> 
>>>>>>>> Christian
>>>>>>>> 
>>>>>>>>> On 13. Mar 2023, at 18:13, Corey Minyard <miny...@acm.org> wrote:
>>>>>>>>> 
>>>>>>>>> On Mon, Mar 13, 2023 at 05:42:39PM +0100, Christian Theune wrote:
>>>>>>>>>> Hrghs. I’m applying your patch to 5.10 as my distro build 
>>>>>>>>>> infrastructure has some patches that don’t apply to 6.2 and that I 
>>>>>>>>>> don’t know how to circumvent quickly enough… :)
>>>>>>>>> 
>>>>>>>>> Ok, there's a
>>>>>>>>> 
>>>>>>>>> https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events-5.10
>>>>>>>>> 
>>>>>>>>> branch available for you to pull.  It's on top of latest 5.10.
>>>>>>>>> 
>>>>>>>>> -corey
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 13. Mar 2023, at 16:59, Christian Theune <c...@flyingcircus.io> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I should be easily able to run 6.2, no worries.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 13. Mar 2023, at 16:33, Corey Minyard <miny...@acm.org> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Mar 13, 2023 at 02:07:01PM +0100, Christian Theune wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> yeah, the IPMI log is fine. This is a 10 minute interval job in 
>>>>>>>>>>>>> our system that exports the log and clears it:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The job looks like this:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool
>>>>>>>>>>>>>  sel elist
>>>>>>>>>>>>> /nix/store/m7lb36dr93qj27r9vskmjihz8imywy86-ipmitool-1.8.18/bin/ipmitool
>>>>>>>>>>>>>  sel clear
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So it’s not atomic but it runs after the boot and the elist 
>>>>>>>>>>>>> should output it properly … at least it did in the past. ;)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As I said - I’m happy to run any patches you have. If you point 
>>>>>>>>>>>>> me to a git branch somewhere I can switch that system easily.
>>>>>>>>>>>> 
>>>>>>>>>>>> Ok, I have a branch at
>>>>>>>>>>>> 
>>>>>>>>>>>> https://github.com/cminyard/linux-ipmi.git:debug-panic-oem-events
>>>>>>>>>>>> 
>>>>>>>>>>>> that has debug tracing.  It will print the function for all panic 
>>>>>>>>>>>> event
>>>>>>>>>>>> handlers, their return values, and adds traces in the IPMI panic 
>>>>>>>>>>>> event
>>>>>>>>>>>> handlers.
>>>>>>>>>>>> 
>>>>>>>>>>>> It's a single patch right on top of 6.2; I'm not sure how portable 
>>>>>>>>>>>> it is
>>>>>>>>>>>> to other kernel versions.  I can port if you like.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>> -corey
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Christian
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 13. Mar 2023, at 13:58, Corey Minyard <miny...@acm.org> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Mar 13, 2023 at 10:27:51AM +0100, Christian Theune wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> alright, so here’s the output from the NixOS machine:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> root@xxx ~ # echo c >/proc/sysrq-trigger
>>>>>>>>>>>>>>> client_loop: send disconnect: Broken pipe
>>>>>>>>>>>>>>> …
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> root@xxx ~ # journalctl -u ipmi-log.service
>>>>>>>>>>>>>>> -- Journal begins at Sun 2023-02-26 14:25:36 CET, ends at Mon 
>>>>>>>>>>>>>>> 2023-03-13 10:25:27 CET. --
>>>>>>>>>>>>>>> Mar 13 10:12:38 xxx ipmi-log-start[520973]: Clearing SEL.  
>>>>>>>>>>>>>>> Please allow a few seconds to erase.
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>> -- Boot fdef496e784e4541abd9ae40df472a0b --
>>>>>>>>>>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]:    1 | 03/13/2023 | 
>>>>>>>>>>>>>>> 09:12:49 | Event Logging Disabled SEL | Log area reset/cleared 
>>>>>>>>>>>>>>> | Asserted
>>>>>>>>>>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1973]:    2 | 03/13/2023 | 
>>>>>>>>>>>>>>> 09:21:06 | Watchdog2 OS Watchdog | Hard reset | Asserted
>>>>>>>>>>>>>>> Mar 13 10:25:07 xxx ipmi-log-start[1977]: Clearing SEL.  Please 
>>>>>>>>>>>>>>> allow a few seconds to erase.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hmm, the SEL got cleared.  That would clear out any of the logs 
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> were issued before that time.  I'm not sure when the above 
>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>> verses the crash, though.  It looks like it occurred as part of 
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> reboot, but I'm not sure what I'm seeing. Maybe you have a 
>>>>>>>>>>>>>> startup
>>>>>>>>>>>>>> process that clears the SEL?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Assuming that's not the issue, what you have looks ok.  I'd need 
>>>>>>>>>>>>>> to add
>>>>>>>>>>>>>> some logs to the kernel to see if the log operation ever happens.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -corey
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The SOL log looks like this:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1107585.917689] sysrq: Trigger a crash
>>>>>>>>>>>>>>> [1107585.921272] Kernel panic - not syncing: sysrq triggered 
>>>>>>>>>>>>>>> crash
>>>>>>>>>>>>>>> [1107585.927178] CPU: 1 PID: 521033 Comm: bash Tainted: G       
>>>>>>>>>>>>>>>    I 5.10.159 #1-NixOS
>>>>>>>>>>>>>>> [1107585.935335] Hardware name: Dell Inc. PowerEdge 
>>>>>>>>>>>>>>> R510/00HDP0, BIOS 1.11.0 07/23/2012
>>>>>>>>>>>>>>> [1107585.943058] Call Trace:
>>>>>>>>>>>>>>> [1107585.945680]  dump_stack+0x6b/0x83
>>>>>>>>>>>>>>> [1107585.949158]  panic+0x101/0x2c8
>>>>>>>>>>>>>>> [1107585.952379]  ? printk+0x58/0x73
>>>>>>>>>>>>>>> [1107585.955687]  sysrq_handle_crash+0x16/0x20
>>>>>>>>>>>>>>> [1107585.959859]  __handle_sysrq.cold+0x43/0x11a
>>>>>>>>>>>>>>> [1107585.964203]  write_sysrq_trigger+0x24/0x40
>>>>>>>>>>>>>>> [1107585.968463]  proc_reg_write+0x51/0x90
>>>>>>>>>>>>>>> [1107585.972290]  vfs_write+0xc3/0x280
>>>>>>>>>>>>>>> [1107585.975768]  ksys_write+0x5f/0xe0
>>>>>>>>>>>>>>> [1107585.979248]  do_syscall_64+0x33/0x40
>>>>>>>>>>>>>>> [1107585.982987]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
>>>>>>>>>>>>>>> [1107585.988199] RIP: 0033:0x7f5873932133
>>>>>>>>>>>>>>> [1107585.991938] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff 
>>>>>>>>>>>>>>> ff eb b3 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 
>>>>>>>>>>>>>>> 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 
>>>>>>>>>>>>>>> 00 41 54 49 89 d4 55 48 89 f5
>>>>>>>>>>>>>>> [1107586.010842] RSP: 002b:00007ffcc13808c8 EFLAGS: 00000246 
>>>>>>>>>>>>>>> ORIG_RAX: 0000000000000001
>>>>>>>>>>>>>>> [1107586.018566] RAX: ffffffffffffffda RBX: 0000000000000002 
>>>>>>>>>>>>>>> RCX: 00007f5873932133
>>>>>>>>>>>>>>> [1107586.025923] RDX: 0000000000000002 RSI: 00000000005c1c08 
>>>>>>>>>>>>>>> RDI: 0000000000000001
>>>>>>>>>>>>>>> [1107586.033213] RBP: 00000000005c1c08 R08: 000000000000000a 
>>>>>>>>>>>>>>> R09: 00007f58739c2f40
>>>>>>>>>>>>>>> [1107586.040504] R10: 00000000005cc348 R11: 0000000000000246 
>>>>>>>>>>>>>>> R12: 0000000000000002
>>>>>>>>>>>>>>> [1107586.047794] R13: 00007f5873a00520 R14: 00007f5873a00720 
>>>>>>>>>>>>>>> R15: 0000000000000002
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Nothing obvious to me here … if you have any further ideas what 
>>>>>>>>>>>>>>> to test, let me know. I should be more responsive again now.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks and kind regards,
>>>>>>>>>>>>>>> Christian
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 5. Mar 2023, at 23:53, Corey Minyard <miny...@acm.org> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Mar 01, 2023 at 06:00:07PM +0100, Christian Theune 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> I’m going to actually attach a serial console to watch the 
>>>>>>>>>>>>>>>>> “echo c” panic, maybe that gives _some_ indication.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Otherwise: I can quickly run patches on the kernel there to 
>>>>>>>>>>>>>>>>> try out things. (And the funding offer still stands.)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Any news on this?  I'm curious what this could be.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -corey
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Christian
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 1. Mar 2023, at 17:58, Corey Minyard <miny...@acm.org> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Feb 28, 2023 at 06:36:17PM +0100, Christian Theune 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> Thanks, both machines report:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> # cat /sys/module/ipmi_msghandler/parameters/panic_op
>>>>>>>>>>>>>>>>>>> string
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> At this point, I have no idea.  I'd have to start adding 
>>>>>>>>>>>>>>>>>> printks into
>>>>>>>>>>>>>>>>>> the code and cause crashes to see what is happing.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Maybe something is getting in the way of the panic notifiers 
>>>>>>>>>>>>>>>>>> and doing
>>>>>>>>>>>>>>>>>> something to prevent the IPMI driver from working.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -corey
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 28. Feb 2023, at 18:04, Corey Minyard <miny...@acm.org> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Oh, I forgot.  You can look at panic_op in 
>>>>>>>>>>>>>>>>>>>> /sys/module/ipmi_msghandler/parameters/panic_op
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> -corey
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 28, 2023 at 05:48:07PM +0100, Christian Theune 
>>>>>>>>>>>>>>>>>>>> via Openipmi-developer wrote:
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 28. Feb 2023, at 17:36, Corey Minyard 
>>>>>>>>>>>>>>>>>>>>>> <miny...@acm.org> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 28, 2023 at 02:53:12PM +0100, Christian 
>>>>>>>>>>>>>>>>>>>>>> Theune via Openipmi-developer wrote:
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I’ve been trying to debug the PANIC and OEM string 
>>>>>>>>>>>>>>>>>>>>>>> handling and am running out of ideas whether this is a 
>>>>>>>>>>>>>>>>>>>>>>> bug or whether something so subtle has changed in my 
>>>>>>>>>>>>>>>>>>>>>>> config that I’m just not seeing it.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> (Note: I’m willing to pay for consulting.)
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Probably not necessary.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks! The offer always stands. If we should ever meet 
>>>>>>>>>>>>>>>>>>>>> I’m also able to pay in beverages. ;)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I have machines that we’ve moved from an older setup 
>>>>>>>>>>>>>>>>>>>>>>> (Gentoo, (mostly) vanilla kernel 4.19.157) to a newer 
>>>>>>>>>>>>>>>>>>>>>>> setup (NixOS, (mostly) vanilla kernel 5.10.159) and I’m 
>>>>>>>>>>>>>>>>>>>>>>> now experiencing crashes that seem to be kernel panics 
>>>>>>>>>>>>>>>>>>>>>>> but do not get the usual messages in the IPMI SEL.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I just tested on stock 5.10.159 and it worked without 
>>>>>>>>>>>>>>>>>>>>>> issue.  Everything
>>>>>>>>>>>>>>>>>>>>>> you have below looks ok.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Can you test by causing a crash with:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> echo c >/proc/sysrq-trigger
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> and see if it works?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Yeah, already tried that and unfortunately that _doesn’t_ 
>>>>>>>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> It sounds like you are having some type of crash that 
>>>>>>>>>>>>>>>>>>>>>> you would normally
>>>>>>>>>>>>>>>>>>>>>> use the IPMI logs to debug.  However, they aren't 
>>>>>>>>>>>>>>>>>>>>>> perfect, the system
>>>>>>>>>>>>>>>>>>>>>> has to stay up long enough to get them into the event 
>>>>>>>>>>>>>>>>>>>>>> log.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I think they are staying up long enough because a panic 
>>>>>>>>>>>>>>>>>>>>> triggers the 255 second bump in the watchdog and only 
>>>>>>>>>>>>>>>>>>>>> then pass on. However, i’ve also noticed that the kernel 
>>>>>>>>>>>>>>>>>>>>> _should_ be rebooting after a panic much faster (and not 
>>>>>>>>>>>>>>>>>>>>> rely on the watchdog) and that doesn’t happen either. 
>>>>>>>>>>>>>>>>>>>>> (Sorry this just popped from the back of my head).
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> In this situation, getting a serial console (probably 
>>>>>>>>>>>>>>>>>>>>>> through IPMI
>>>>>>>>>>>>>>>>>>>>>> Serial over LAN) and getting the console output on a 
>>>>>>>>>>>>>>>>>>>>>> crash is probably
>>>>>>>>>>>>>>>>>>>>>> your best option.  You can use ipmitool for this, or I 
>>>>>>>>>>>>>>>>>>>>>> have a library
>>>>>>>>>>>>>>>>>>>>>> that is able to make connections to serial ports, 
>>>>>>>>>>>>>>>>>>>>>> including through IPMI
>>>>>>>>>>>>>>>>>>>>>> SoL.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Yup. Been there, too. :)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Unfortunately we’re currently chasing something that pops 
>>>>>>>>>>>>>>>>>>>>> up very randomly on somewhat odd machines and I also have 
>>>>>>>>>>>>>>>>>>>>> the feeling that it’s systematically broken right now (as 
>>>>>>>>>>>>>>>>>>>>> the “echo c” doesn’t work).
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>>>>>>>>>>> Christian
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>>>>>>>>>>>>>>>>> Flying Circus Internet Operations GmbH · 
>>>>>>>>>>>>>>>>>>>>> https://flyingcircus.io
>>>>>>>>>>>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>>>>>>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, 
>>>>>>>>>>>>>>>>>>>>> Christian Zagrodnick
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>> Openipmi-developer mailing list
>>>>>>>>>>>>>>>>>>>>> Openipmi-developer@lists.sourceforge.net
>>>>>>>>>>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Liebe Grüße,
>>>>>>>>>>>>>>>>>>> Christian Theune
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>>>>>>>>>>>>>>> Flying Circus Internet Operations GmbH · 
>>>>>>>>>>>>>>>>>>> https://flyingcircus.io
>>>>>>>>>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>>>>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, 
>>>>>>>>>>>>>>>>>>> Christian Zagrodnick
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Liebe Grüße,
>>>>>>>>>>>>>>>>> Christian Theune
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>>>>>>>>>>>>> Flying Circus Internet Operations GmbH · 
>>>>>>>>>>>>>>>>> https://flyingcircus.io
>>>>>>>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, 
>>>>>>>>>>>>>>>>> Christian Zagrodnick
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Liebe Grüße,
>>>>>>>>>>>>>>> Christian Theune
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, 
>>>>>>>>>>>>>>> Christian Zagrodnick
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Liebe Grüße,
>>>>>>>>>>>>> Christian Theune
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, 
>>>>>>>>>>>>> Christian Zagrodnick
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Liebe Grüße,
>>>>>>>>>> Christian Theune
>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>>>>>>>> Zagrodnick
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Liebe Grüße,
>>>>>>>> Christian Theune
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>>>>>> Zagrodnick
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Openipmi-developer mailing list
>>>>>>>> Openipmi-developer@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer
>>>>>> 
>>>>>> Liebe Grüße,
>>>>>> Christian Theune
>>>>>> 
>>>>>> -- 
>>>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>>>> Zagrodnick
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Openipmi-developer mailing list
>>>>>> Openipmi-developer@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer
>>>>> <0001-ipmi-watchdog-Set-panic-count-to-proper-value-on-a-p.patch>
>>>> 
>>>> Liebe Grüße,
>>>> Christian Theune
>>>> 
>>>> -- 
>>>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>>>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>>>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>>>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>>>> Zagrodnick
>>>> 
>>> 
>>> 
>>>> _______________________________________________
>>>> Openipmi-developer mailing list
>>>> Openipmi-developer@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer
>>> 
>>> <0001-ipmi-watchdog-replace-atomic_add-and-atomic_sub.patch>
>> 
>> Liebe Grüße,
>> Christian Theune
>> 
>> -- 
>> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
>> Flying Circus Internet Operations GmbH · https://flyingcircus.io
>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
>> Zagrodnick
>> 
>> 
>> 
>> _______________________________________________
>> Openipmi-developer mailing list
>> Openipmi-developer@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/openipmi-developer


Liebe Grüße,
Christian Theune

-- 
Christian Theune · c...@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



_______________________________________________
Openipmi-developer mailing list
Openipmi-developer@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to