On Tue, Oct 03, 2023 at 06:47:49AM +0200, Christian Theune wrote: > Hey, > > > On 2. Oct 2023, at 17:08, Corey Minyard <[email protected]> wrote: > > > > On Mon, Oct 02, 2023 at 08:05:09AM +0200, Christian Theune wrote: > > > > ...snip... > > > >>>> Can you not get kernel coredumps? > >>> Unfortunately no and I still have absolutely now idea why the watchdog > >>> triggers… I have currently attached dozens of servers that are part of a > >>> mysterious series of crashes but they didn’t crash after I attached the > >>> SOL continuously. Just my kind of luck I guess … ;) > >>> > >>> It might be a clue. Can you make sure flow-control is turned off on the > >>> SOL connection and console? If you have "r" on the console= command > >>> (like console=115200n81r) , if the BMC stops taking characters you can > >>> hang the kernel. > >>> > >>> You might want to make sure getty has RTS turned off, too. > >>> > >>> The trouble is, of course, that you can lose characters because of a slow > >>> BMC. But it's generally a bad idea to run a console with flow control > >>> enabled. > >> > >> Sorry, that might have been a misunderstanding: I’m not catching the > >> crashes currently because all the machines that used to crash now seem to > >> not want to crash anymore. I guess we’re on a Heisenbug here. Getting > >> output from the SOL works absolutely fine, so I expect to see a kernel > >> crash in the SOL once it happens. > >> > >> I’m somewhat suspecting that we’ll find another bug that causes those > >> specific crashes not appear in the SEL, though … > >> > >> And then again: maybe it’s not a Heisenbug, but maybe whatever caused the > >> crashes has been fixed in between and I’ll never know … ;) > >> > > > > I understood. I'm saying that maybe the machines aren't crashing any > > more *because* you are monitoring them with SOL. > > Oooooooh. I’m glad we took this detour - I knew something was off, but I was > the one misunderstanding. Thanks for taking the time to explain it again! I > was a bit stuck on the “well it’s a Heisenbug then” but didn’t get that it > was literally so… > > > Perhaps a lot of kernel output comes out all at once, it gets flow > > controlled by the BMC, the kernel hangs waiting for printk output, and > > the watchdog then goes off. Newer kernels have fixes to avoid this > > problem, but older ones don't. > > > > There would be no OS crash, no SEL output, no coredump, just a watchdog > > reboot. > > Understood. What would be a newer kernel? We’re running 5.10(.190+) at the > moment. > > The interesting part here is that we have been logging to the serial console > without anything attached normally > for a long long time (think: 10 years plus) so there is still a bit of doubt > as this started to creep up only recently.
Yeah, I understand how this would be a strange scenario. I have seen this happen in the real world, so it is something that's possible, but I think the printk changes went in before 5.10. Maybe a firmware update to the BMC? I think you would have mentioned that, though. Anyway, the only way to know for sure would be to turn off the SOL monitoring and see if it re-occurs. I can understand why you wouldn't want to do that :). -corey > > > If you turn off the SOL monitoring and the problem comes back, that > > would be a pretty good indication that something like that is happening. > > Unfortunately, it's hard to debug because you can't get info from your > > primary debugging interface. > > Yeah. That’s something I’ll discuss with my team. I originally intended to > turn off the continuous SOL monitoring but after this goose chase I’m > somewhat willing to make it a regular thing. > > > Of course, the bug may have been fixed by a kernel or app upgrade, too. > > Like you say with things like this, you may never know :). > > Kernel would be the most obvious choice for us as the affected hosts are > really only Qemu/KVM servers that didn’t see any relevant updates in the > userland in the past months. > > Thanks again, > Christian > > -- > Christian Theune · [email protected] · +49 345 219401 0 > Flying Circus Internet Operations GmbH · https://flyingcircus.io > Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland > HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick > _______________________________________________ Openipmi-developer mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/openipmi-developer
