Sometimes a machine goes unresponsive. In this case, a non-ECC RAM machine.

The reason could be that something in the hardware or kernel failed, e.g. a bit flip error [1].

In this case (for a non-kernel developer), tough luck, and the proper thing would be to reboot, and keep statistics over failures on that machine and replace the hardware should the crashes go above some frequency threshold.

OR, the reason could be that the user process(es) that do the work on the machine, consumed all resources, in particular RAM, in such a particularly nasty way that it stopped responding - (those) process(es) shut down, OpenSSHD shut down, no console responsivity including CTRL+ALT+DEL (didn't ever try kernel debug shortcut when I experienced something like this though).

I think I've seen cases of that where this is the case and it's totally unlikely that the machine not was busy swapping to disk.

The primary reason for such a failure, then, I guess, would be that memory had run so low that malloc() failures took down processes or at least froze them.

Any failure of this category would provide a good reason to debug the userland software.

The most interesting thing to know from a failed machine would be which of these two categories of crashes it was that happened, as the response is so different.

And, to figure this a bit, I wanted to raise the question with you, if there's any way to guaranteedly trig some logic when userland died totally.

E.g. as a kernel thread.

I guess "died totally" could be defined as that some watchdog/monitoring process in userland hadn't given the kernel an "I'm okay" signal in 90 seconds.

I guess proper response would be to print a complete report to the (serial) console e.g. names and stacks of all running processes, unmount and sync the filesystems, and reboot - all in a malloc()-failure proof way.

So to sum up:

* Did anyone have any problems of this kind (total system unresponsivity after supposed system overload due to userland malbehavior)?

* Does anyone feel a logic something like this (to distinguish HW/kernel failure from userland malbehavior) would be materially relevant?

* If so where and how do you think it would be best implemented, is there anything in the box that could fill this function in an as foolproof way as possible already?



Reply via email to