Hi, > On 1. Oct 2023, at 22:59, Corey Minyard <[email protected]> wrote: > > On Oct 1, 2023 11:14 AM, Christian Theune via Openipmi-developer > <[email protected]> wrote: > Hi, > > On 1. Oct 2023, at 03:49, Corey Minyard <[email protected]> wrote: > > > > On Sat, Sep 30, 2023 at 11:14:01PM +0200, Christian Theune via > > Openipmi-developer wrote: > >> Hi, > >> > >> sorry if this isn’t directly a developers question, but I’ve run out of > >> avenues after googling and looking around… > >> > >> We’re experiencing weird system stability issue where the “log to SEL” > >> doesn’t cut it: we see watchdog reboots but no kernel output whatsoever > >> ending up in the SEL. (I’ve debugged this with Corey before and we found > >> something to fix but the watchdog events we’re experiencing still don’t > >> get logged in more detail.) > > > > Can you not get kernel coredumps? > Unfortunately no and I still have absolutely now idea why the watchdog > triggers… I have currently attached dozens of servers that are part of a > mysterious series of crashes but they didn’t crash after I attached the SOL > continuously. Just my kind of luck I guess … ;) > > It might be a clue. Can you make sure flow-control is turned off on the SOL > connection and console? If you have "r" on the console= command (like > console=115200n81r) , if the BMC stops taking characters you can hang the > kernel. > > You might want to make sure getty has RTS turned off, too. > > The trouble is, of course, that you can lose characters because of a slow > BMC. But it's generally a bad idea to run a console with flow control > enabled.
Sorry, that might have been a misunderstanding: I’m not catching the crashes currently because all the machines that used to crash now seem to not want to crash anymore. I guess we’re on a Heisenbug here. Getting output from the SOL works absolutely fine, so I expect to see a kernel crash in the SOL once it happens. I’m somewhat suspecting that we’ll find another bug that causes those specific crashes not appear in the SEL, though … And then again: maybe it’s not a Heisenbug, but maybe whatever caused the crashes has been fixed in between and I’ll never know … ;) > As we’re continuously updating our environment it might also be that we’ve > successfully evaded a kernel bug that was haunting us … maybe … ;) > >> > >> I’m wondering: does anyone know of a “push” solution to instruct the BMC > >> (mostly Supermicro in our case) to push SOL output proactively through > >> some protocol like syslog? > > > > The SEL probably isn't big or fast enough to take system logs. You > > could create something like this as part of printk, but I suspect that > > it would quickly overflow the SEL. > Yeah, I wasn’t thinking about the SEL but wondering whether serial output > could be shipped in a push-manner from the BMC without having to attach and > authenticate. > > That would take some work in the BMC. That’s what I thought. Not a promising avenue I guess … I wouldn’t even know who to talk to with any chance of success … ;) > >> Otherwise we’d need to set up a central host with passwords for dozens of > >> hosts to pull the SOL for logging and that doesn’t feel right either … -__ > > > > I know people that do this; it's not terrible. You do have all of your > > IPMI passwords in one place, that's the biggest issue, but IMHO you > > should be monitoring the output of your consoles, anyway. > Yeah, that’s what I’m pondering, too. IMHO it’s quite a bit terrible and thus > I was wondering whether the BMC might have a built-in solution that would > turn this upside down … but I gess not > > I support a program called ser2net that is capable of making SOL > > connections, logging the output, and allowing connections to the > > console. That would be a pretty complicated setup, but I can help you > > with it, if you like. > The multiplexing sounds great. I’ve built a small shell wrapper to manage SOL > connections and their logging (and reconnecting if the BMC acts up) which > works for now. > From a design perspective I’d really love this to be push-based. I researched > the dmtf site, but didn’t find anything there either … I guess I’m the > odd-one out then … > No idea. So with your little wrapper connected everything seems to work ok. > > Outside of the flow control thing, I have no idea. Thanks for the input, though! I wasn’t sure I was missing something obvious. I’ll let you know if I should ever find out what’s going on here … Christian -- Christian Theune · [email protected] · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick _______________________________________________ Openipmi-developer mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/openipmi-developer
