Re: How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

lists Mon, 17 Oct 2016 18:47:56 -0700

Tue, 18 Oct 2016 08:43:39 +0800 Tinker <ti...@openmailbox.org>
> Hi Anton,
> 
> You misread me -


Hi Tinker,

This was intentional, as SoftECC is what you'd relate to, in contrast to
the hardware ECC (SECDED), see: https://en.wikipedia.org/wiki/ECC_memory

After the article page below references there is an external link called

SoftECC: A System for Software Memory Integrity Checking
http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf

> What I queried for was not how to trig some event logic on bit flip 
> errors (because on a non-ECC machine those will generally appear as data 
> corruption or undefined behavior only) or other hardware or kernel 
> error, but:
> 
> How to trig some event logic when the system has become vegetable 
> because of overload by the userland?

You're referring here to a watchdog timer, as present in some (most) BMC
controllers, this usually requires an OS timer reset process, see these:

https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface
https://en.wikipedia.org/wiki/Watchdog_timer

> My limited experience here says that system overload caused by user 
> processes can lead to that all processes die or freeze, and that the 
> system goes otherwise unresponsive, except for that terminal input still 
> is echoed.

Well, what are the process limits used for then, these should help here?
Then as difficult as it gets, the mission is to run the system reliably.

> And for that I speculated that such event logic could be implemented as 
> some in-kernel code e.g. as a kernel thread, if those have some kind of 
> higher execution guarantee than user process code,

Most probably, you are well aware of kernel level tracing and debugging.

> E.g., when a userland watchdog/monitoring process didn't send any "I'm 
> OK" signal to that thread for 60 seconds, that thread would dump the 
> system's state to the console and reboot the machine.

The watchdog is realised in HW with a BIOS option to enable its timeout.
When timer is not cleared by the OS process, the BMC reboots the system.

> This way I'd be able to distinguish userland-caused system crashes from 
> hardware/kernel crashes, as the further always make that output and 
> reboot, whereas the latter don't (but instead reboot, crash to kernel 
> debug console, or just freeze the system altogether).

Debugging user programs, and the kernel, is well documented in manuals.
Maybe you have some idea or proposal, that I am not able to understand.

> Do you see where I was heading now?
> 
> Tinker
> 

Let's hope there're some pointers, or you further expand your concepts.
Both HW ECC & watchdog timer are available on server class main boards,
& work as advertised: ECC transparently, timer with a SW guard process.
What I want to know, however, is how is OpenBSD handles ECC, if at all?

Kind regards,
Anton

Re: How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

Reply via email to