On Mon, Oct 02, 2023 at 08:05:09AM +0200, Christian Theune wrote: ...snip...
> > > Can you not get kernel coredumps? > > Unfortunately no and I still have absolutely now idea why the watchdog > > triggers… I have currently attached dozens of servers that are part of a > > mysterious series of crashes but they didn’t crash after I attached the SOL > > continuously. Just my kind of luck I guess … ;) > > > > It might be a clue. Can you make sure flow-control is turned off on the > > SOL connection and console? If you have "r" on the console= command (like > > console=115200n81r) , if the BMC stops taking characters you can hang the > > kernel. > > > > You might want to make sure getty has RTS turned off, too. > > > > The trouble is, of course, that you can lose characters because of a slow > > BMC. But it's generally a bad idea to run a console with flow control > > enabled. > > Sorry, that might have been a misunderstanding: I’m not catching the crashes > currently because all the machines that used to crash now seem to not want to > crash anymore. I guess we’re on a Heisenbug here. Getting output from the SOL > works absolutely fine, so I expect to see a kernel crash in the SOL once it > happens. > > I’m somewhat suspecting that we’ll find another bug that causes those > specific crashes not appear in the SEL, though … > > And then again: maybe it’s not a Heisenbug, but maybe whatever caused the > crashes has been fixed in between and I’ll never know … ;) > I understood. I'm saying that maybe the machines aren't crashing any more *because* you are monitoring them with SOL. Perhaps a lot of kernel output comes out all at once, it gets flow controlled by the BMC, the kernel hangs waiting for printk output, and the watchdog then goes off. Newer kernels have fixes to avoid this problem, but older ones don't. There would be no OS crash, no SEL output, no coredump, just a watchdog reboot. If you turn off the SOL monitoring and the problem comes back, that would be a pretty good indication that something like that is happening. Unfortunately, it's hard to debug because you can't get info from your primary debugging interface. Of course, the bug may have been fixed by a kernel or app upgrade, too. Like you say with things like this, you may never know :). -corey _______________________________________________ Openipmi-developer mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/openipmi-developer
