On Tue, Oct 03, 2023 at 06:47:49AM +0200, Christian Theune wrote:
> Hey,
> 
> > On 2. Oct 2023, at 17:08, Corey Minyard <[email protected]> wrote:
> > 
> > On Mon, Oct 02, 2023 at 08:05:09AM +0200, Christian Theune wrote:
> > 
> > ...snip...
> > 
> >>>> Can you not get kernel coredumps?
> >>> Unfortunately no and I still have absolutely now idea why the watchdog 
> >>> triggers… I have currently attached dozens of servers that are part of a 
> >>> mysterious series of crashes but they didn’t crash after I attached the 
> >>> SOL continuously. Just my kind of luck I guess … ;)
> >>> 
> >>> It might be a clue.  Can you make sure flow-control is turned off on the 
> >>> SOL connection and console?  If you have "r" on the console= command 
> >>> (like console=115200n81r) , if the BMC stops taking characters you can 
> >>> hang the kernel.
> >>> 
> >>> You might want to make sure getty has RTS turned off, too.
> >>> 
> >>> The trouble is, of course, that you can lose characters because of a slow 
> >>> BMC.  But it's generally a bad idea to run a console with flow control 
> >>> enabled.
> >> 
> >> Sorry, that might have been a misunderstanding: I’m not catching the 
> >> crashes currently because all the machines that used to crash now seem to 
> >> not want to crash anymore. I guess we’re on a Heisenbug here. Getting 
> >> output from the SOL works absolutely fine, so I expect to see a kernel 
> >> crash in the SOL once it happens.
> >> 
> >> I’m somewhat suspecting that we’ll find another bug that causes those 
> >> specific crashes not appear in the SEL, though … 
> >> 
> >> And then again: maybe it’s not a Heisenbug, but maybe whatever caused the 
> >> crashes has been fixed in between and I’ll never know … ;)
> >> 
> > 
> > I understood.  I'm saying that maybe the machines aren't crashing any
> > more *because* you are monitoring them with SOL.
> 
> Oooooooh. I’m glad we took this detour - I knew something was off, but I was 
> the one misunderstanding. Thanks for taking the time to explain it again! I 
> was a bit stuck on the “well it’s a Heisenbug then” but didn’t get that it 
> was literally so… 
> 
> > Perhaps a lot of kernel output comes out all at once, it gets flow
> > controlled by the BMC, the kernel hangs waiting for printk output, and
> > the watchdog then goes off.  Newer kernels have fixes to avoid this
> > problem, but older ones don't.
> > 
> > There would be no OS crash, no SEL output, no coredump, just a watchdog
> > reboot.
>  
> Understood. What would be a newer kernel? We’re running 5.10(.190+) at the 
> moment.
> 
> The interesting part here is that we have been logging to the serial console 
> without anything attached normally
> for a long long time (think: 10 years plus) so there is still a bit of doubt 
> as this started to creep up only recently.

Yeah, I understand how this would be a strange scenario.  I have seen
this happen in the real world, so it is something that's possible, but I
think the printk changes went in before 5.10.

Maybe a firmware update to the BMC?  I think you would have mentioned
that, though.

Anyway, the only way to know for sure would be to turn off the SOL
monitoring and see if it re-occurs.  I can understand why you wouldn't
want to do that :).

-corey

> 
> > If you turn off the SOL monitoring and the problem comes back, that
> > would be a pretty good indication that something like that is happening.
> > Unfortunately, it's hard to debug because you can't get info from your
> > primary debugging interface.
> 
> Yeah. That’s something I’ll discuss with my team. I originally intended to 
> turn off the continuous SOL monitoring but after this goose chase I’m 
> somewhat willing to make it a regular thing.
> 
> > Of course, the bug may have been fixed by a kernel or app upgrade, too.
> > Like you say with things like this, you may never know :).
> 
> Kernel would be the most obvious choice for us as the affected hosts are 
> really only Qemu/KVM servers that didn’t see any relevant updates in the 
> userland in the past months.
> 
> Thanks again,
> Christian
> 
> -- 
> Christian Theune · [email protected] · +49 345 219401 0
> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> 


_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to