Hey,

> On 2. Oct 2023, at 17:08, Corey Minyard <[email protected]> wrote:
> 
> On Mon, Oct 02, 2023 at 08:05:09AM +0200, Christian Theune wrote:
> 
> ...snip...
> 
>>>> Can you not get kernel coredumps?
>>> Unfortunately no and I still have absolutely now idea why the watchdog 
>>> triggers… I have currently attached dozens of servers that are part of a 
>>> mysterious series of crashes but they didn’t crash after I attached the SOL 
>>> continuously. Just my kind of luck I guess … ;)
>>> 
>>> It might be a clue.  Can you make sure flow-control is turned off on the 
>>> SOL connection and console?  If you have "r" on the console= command (like 
>>> console=115200n81r) , if the BMC stops taking characters you can hang the 
>>> kernel.
>>> 
>>> You might want to make sure getty has RTS turned off, too.
>>> 
>>> The trouble is, of course, that you can lose characters because of a slow 
>>> BMC.  But it's generally a bad idea to run a console with flow control 
>>> enabled.
>> 
>> Sorry, that might have been a misunderstanding: I’m not catching the crashes 
>> currently because all the machines that used to crash now seem to not want 
>> to crash anymore. I guess we’re on a Heisenbug here. Getting output from the 
>> SOL works absolutely fine, so I expect to see a kernel crash in the SOL once 
>> it happens.
>> 
>> I’m somewhat suspecting that we’ll find another bug that causes those 
>> specific crashes not appear in the SEL, though … 
>> 
>> And then again: maybe it’s not a Heisenbug, but maybe whatever caused the 
>> crashes has been fixed in between and I’ll never know … ;)
>> 
> 
> I understood.  I'm saying that maybe the machines aren't crashing any
> more *because* you are monitoring them with SOL.

Oooooooh. I’m glad we took this detour - I knew something was off, but I was 
the one misunderstanding. Thanks for taking the time to explain it again! I was 
a bit stuck on the “well it’s a Heisenbug then” but didn’t get that it was 
literally so… 

> Perhaps a lot of kernel output comes out all at once, it gets flow
> controlled by the BMC, the kernel hangs waiting for printk output, and
> the watchdog then goes off.  Newer kernels have fixes to avoid this
> problem, but older ones don't.
> 
> There would be no OS crash, no SEL output, no coredump, just a watchdog
> reboot.
 
Understood. What would be a newer kernel? We’re running 5.10(.190+) at the 
moment.

The interesting part here is that we have been logging to the serial console 
without anything attached normally
for a long long time (think: 10 years plus) so there is still a bit of doubt as 
this started to creep up only recently.

> If you turn off the SOL monitoring and the problem comes back, that
> would be a pretty good indication that something like that is happening.
> Unfortunately, it's hard to debug because you can't get info from your
> primary debugging interface.

Yeah. That’s something I’ll discuss with my team. I originally intended to turn 
off the continuous SOL monitoring but after this goose chase I’m somewhat 
willing to make it a regular thing.

> Of course, the bug may have been fixed by a kernel or app upgrade, too.
> Like you say with things like this, you may never know :).

Kernel would be the most obvious choice for us as the affected hosts are really 
only Qemu/KVM servers that didn’t see any relevant updates in the userland in 
the past months.

Thanks again,
Christian

-- 
Christian Theune · [email protected] · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to