Hey, > On 2. Oct 2023, at 17:08, Corey Minyard <[email protected]> wrote: > > On Mon, Oct 02, 2023 at 08:05:09AM +0200, Christian Theune wrote: > > ...snip... > >>>> Can you not get kernel coredumps? >>> Unfortunately no and I still have absolutely now idea why the watchdog >>> triggers… I have currently attached dozens of servers that are part of a >>> mysterious series of crashes but they didn’t crash after I attached the SOL >>> continuously. Just my kind of luck I guess … ;) >>> >>> It might be a clue. Can you make sure flow-control is turned off on the >>> SOL connection and console? If you have "r" on the console= command (like >>> console=115200n81r) , if the BMC stops taking characters you can hang the >>> kernel. >>> >>> You might want to make sure getty has RTS turned off, too. >>> >>> The trouble is, of course, that you can lose characters because of a slow >>> BMC. But it's generally a bad idea to run a console with flow control >>> enabled. >> >> Sorry, that might have been a misunderstanding: I’m not catching the crashes >> currently because all the machines that used to crash now seem to not want >> to crash anymore. I guess we’re on a Heisenbug here. Getting output from the >> SOL works absolutely fine, so I expect to see a kernel crash in the SOL once >> it happens. >> >> I’m somewhat suspecting that we’ll find another bug that causes those >> specific crashes not appear in the SEL, though … >> >> And then again: maybe it’s not a Heisenbug, but maybe whatever caused the >> crashes has been fixed in between and I’ll never know … ;) >> > > I understood. I'm saying that maybe the machines aren't crashing any > more *because* you are monitoring them with SOL.
Oooooooh. I’m glad we took this detour - I knew something was off, but I was the one misunderstanding. Thanks for taking the time to explain it again! I was a bit stuck on the “well it’s a Heisenbug then” but didn’t get that it was literally so… > Perhaps a lot of kernel output comes out all at once, it gets flow > controlled by the BMC, the kernel hangs waiting for printk output, and > the watchdog then goes off. Newer kernels have fixes to avoid this > problem, but older ones don't. > > There would be no OS crash, no SEL output, no coredump, just a watchdog > reboot. Understood. What would be a newer kernel? We’re running 5.10(.190+) at the moment. The interesting part here is that we have been logging to the serial console without anything attached normally for a long long time (think: 10 years plus) so there is still a bit of doubt as this started to creep up only recently. > If you turn off the SOL monitoring and the problem comes back, that > would be a pretty good indication that something like that is happening. > Unfortunately, it's hard to debug because you can't get info from your > primary debugging interface. Yeah. That’s something I’ll discuss with my team. I originally intended to turn off the continuous SOL monitoring but after this goose chase I’m somewhat willing to make it a regular thing. > Of course, the bug may have been fixed by a kernel or app upgrade, too. > Like you say with things like this, you may never know :). Kernel would be the most obvious choice for us as the affected hosts are really only Qemu/KVM servers that didn’t see any relevant updates in the userland in the past months. Thanks again, Christian -- Christian Theune · [email protected] · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick _______________________________________________ Openipmi-developer mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/openipmi-developer
