Tue, 18 Oct 2016 08:43:39 +0800 Tinker <ti...@openmailbox.org>
> Hi Anton,
> You misread me -
This was intentional, as SoftECC is what you'd relate to, in contrast to
the hardware ECC (SECDED), see: https://en.wikipedia.org/wiki/ECC_memory
After the article page below references there is an external link called
SoftECC: A System for Software Memory Integrity Checking
> What I queried for was not how to trig some event logic on bit flip
> errors (because on a non-ECC machine those will generally appear as data
> corruption or undefined behavior only) or other hardware or kernel
> error, but:
> How to trig some event logic when the system has become vegetable
> because of overload by the userland?
You're referring here to a watchdog timer, as present in some (most) BMC
controllers, this usually requires an OS timer reset process, see these:
> My limited experience here says that system overload caused by user
> processes can lead to that all processes die or freeze, and that the
> system goes otherwise unresponsive, except for that terminal input still
> is echoed.
Well, what are the process limits used for then, these should help here?
Then as difficult as it gets, the mission is to run the system reliably.
> And for that I speculated that such event logic could be implemented as
> some in-kernel code e.g. as a kernel thread, if those have some kind of
> higher execution guarantee than user process code,
Most probably, you are well aware of kernel level tracing and debugging.
> E.g., when a userland watchdog/monitoring process didn't send any "I'm
> OK" signal to that thread for 60 seconds, that thread would dump the
> system's state to the console and reboot the machine.
The watchdog is realised in HW with a BIOS option to enable its timeout.
When timer is not cleared by the OS process, the BMC reboots the system.
> This way I'd be able to distinguish userland-caused system crashes from
> hardware/kernel crashes, as the further always make that output and
> reboot, whereas the latter don't (but instead reboot, crash to kernel
> debug console, or just freeze the system altogether).
Debugging user programs, and the kernel, is well documented in manuals.
Maybe you have some idea or proposal, that I am not able to understand.
> Do you see where I was heading now?
Let's hope there're some pointers, or you further expand your concepts.
Both HW ECC & watchdog timer are available on server class main boards,
& work as advertised: ECC transparently, timer with a SW guard process.
What I want to know, however, is how is OpenBSD handles ECC, if at all?