Tue, 18 Oct 2016 08:43:39 +0800 Tinker <ti...@openmailbox.org> > Hi Anton, > > You misread me -
Hi Tinker, This was intentional, as SoftECC is what you'd relate to, in contrast to the hardware ECC (SECDED), see: https://en.wikipedia.org/wiki/ECC_memory After the article page below references there is an external link called SoftECC: A System for Software Memory Integrity Checking http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf > What I queried for was not how to trig some event logic on bit flip > errors (because on a non-ECC machine those will generally appear as data > corruption or undefined behavior only) or other hardware or kernel > error, but: > > How to trig some event logic when the system has become vegetable > because of overload by the userland? You're referring here to a watchdog timer, as present in some (most) BMC controllers, this usually requires an OS timer reset process, see these: https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface https://en.wikipedia.org/wiki/Watchdog_timer > My limited experience here says that system overload caused by user > processes can lead to that all processes die or freeze, and that the > system goes otherwise unresponsive, except for that terminal input still > is echoed. Well, what are the process limits used for then, these should help here? Then as difficult as it gets, the mission is to run the system reliably. > And for that I speculated that such event logic could be implemented as > some in-kernel code e.g. as a kernel thread, if those have some kind of > higher execution guarantee than user process code, Most probably, you are well aware of kernel level tracing and debugging. > E.g., when a userland watchdog/monitoring process didn't send any "I'm > OK" signal to that thread for 60 seconds, that thread would dump the > system's state to the console and reboot the machine. The watchdog is realised in HW with a BIOS option to enable its timeout. When timer is not cleared by the OS process, the BMC reboots the system. > This way I'd be able to distinguish userland-caused system crashes from > hardware/kernel crashes, as the further always make that output and > reboot, whereas the latter don't (but instead reboot, crash to kernel > debug console, or just freeze the system altogether). Debugging user programs, and the kernel, is well documented in manuals. Maybe you have some idea or proposal, that I am not able to understand. > Do you see where I was heading now? > > Tinker > Let's hope there're some pointers, or you further expand your concepts. Both HW ECC & watchdog timer are available on server class main boards, & work as advertised: ECC transparently, timer with a SW guard process. What I want to know, however, is how is OpenBSD handles ECC, if at all? Kind regards, Anton