How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

Tinker Mon, 17 Oct 2016 06:49:54 -0700

Sometimes a machine goes unresponsive. In this case, a non-ECC RAMmachine.

The reason could be that something in the hardware or kernel failed,e.g. a bit flip error [1].

In this case (for a non-kernel developer), tough luck, and the properthing would be to reboot, and keep statistics over failures on thatmachine and replace the hardware should the crashes go above somefrequency threshold.

OR, the reason could be that the user process(es) that do the work onthe machine, consumed all resources, in particular RAM, in such aparticularly nasty way that it stopped responding - (those) process(es)shut down, OpenSSHD shut down, no console responsivity includingCTRL+ALT+DEL (didn't ever try kernel debug shortcut when I experiencedsomething like this though).

I think I've seen cases of that where this is the case and it's totallyunlikely that the machine not was busy swapping to disk.

The primary reason for such a failure, then, I guess, would be thatmemory had run so low that malloc() failures took down processes or atleast froze them.

Any failure of this category would provide a good reason to debug theuserland software.

The most interesting thing to know from a failed machine would be whichof these two categories of crashes it was that happened, as the responseis so different.

And, to figure this a bit, I wanted to raise the question with you, ifthere's any way to guaranteedly trig some logic when userland diedtotally.


E.g. as a kernel thread.

I guess "died totally" could be defined as that some watchdog/monitoringprocess in userland hadn't given the kernel an "I'm okay" signal in 90seconds.

I guess proper response would be to print a complete report to the(serial) console e.g. names and stacks of all running processes, unmountand sync the filesystems, and reboot - all in a malloc()-failure proofway.



So to sum up:

* Did anyone have any problems of this kind (total systemunresponsivity after supposed system overload due to userlandmalbehavior)?

* Does anyone feel a logic something like this (to distinguishHW/kernel failure from userland malbehavior) would be materiallyrelevant?

* If so where and how do you think it would be best implemented, isthere anything in the box that could fill this function in an asfoolproof way as possible already?



Thanks!
Tinker

[1]http://stackoverflow.com/questions/23587591/software-memory-bit-flip-detection-for-platforms-without-ecc

How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

Reply via email to