Sometimes a machine goes unresponsive. In this case, a non-ECC RAM
machine.
The reason could be that something in the hardware or kernel failed,
e.g. a bit flip error [1].
In this case (for a non-kernel developer), tough luck, and the proper
thing would be to reboot, and keep statistics over failures on that
machine and replace the hardware should the crashes go above some
frequency threshold.
OR, the reason could be that the user process(es) that do the work on
the machine, consumed all resources, in particular RAM, in such a
particularly nasty way that it stopped responding - (those) process(es)
shut down, OpenSSHD shut down, no console responsivity including
CTRL+ALT+DEL (didn't ever try kernel debug shortcut when I experienced
something like this though).
I think I've seen cases of that where this is the case and it's totally
unlikely that the machine not was busy swapping to disk.
The primary reason for such a failure, then, I guess, would be that
memory had run so low that malloc() failures took down processes or at
least froze them.
Any failure of this category would provide a good reason to debug the
userland software.
The most interesting thing to know from a failed machine would be which
of these two categories of crashes it was that happened, as the response
is so different.
And, to figure this a bit, I wanted to raise the question with you, if
there's any way to guaranteedly trig some logic when userland died
totally.
E.g. as a kernel thread.
I guess "died totally" could be defined as that some watchdog/monitoring
process in userland hadn't given the kernel an "I'm okay" signal in 90
seconds.
I guess proper response would be to print a complete report to the
(serial) console e.g. names and stacks of all running processes, unmount
and sync the filesystems, and reboot - all in a malloc()-failure proof
way.
So to sum up:
* Did anyone have any problems of this kind (total system
unresponsivity after supposed system overload due to userland
malbehavior)?
* Does anyone feel a logic something like this (to distinguish
HW/kernel failure from userland malbehavior) would be materially
relevant?
* If so where and how do you think it would be best implemented, is
there anything in the box that could fill this function in an as
foolproof way as possible already?
Thanks!
Tinker
[1]
http://stackoverflow.com/questions/23587591/software-memory-bit-flip-detection-for-platforms-without-ecc
- How assign some logic to handle system-gone-totally-unres... Tinker
-