Anton,

On 2016-10-18 09:46, li...@wrant.com wrote:
Hi Tinker,
[..]
How to trig some event logic when the system has become vegetable
because of overload by the userland?


You're referring here to a watchdog timer, as present in some (most) BMC controllers, this usually requires an OS timer reset process, see these:
[..]
The watchdog is realised in HW with a BIOS option to enable its timeout. When timer is not cleared by the OS process, the BMC reboots the system.
[..]
timer with a SW guard process.

This is an ARM SBC, it has no BMC and AFAIK no watchdog or other timer that can be programmed to cause a reboot, if you are aware of anything like that on ARM SBC:s let me know?

My limited experience here says that system overload caused by user
processes can lead to that all processes die or freeze, and that the
system goes otherwise unresponsive, except for that terminal input still
is echoed.

Well, what are the process limits used for then, these should help here? Then as difficult as it gets, the mission is to run the system reliably.

Because of limited RAM, RAM is scarce and under some pressure.

Running out of RAM is closer to happening on a limited-resources machine like this where one process may rather consume 50-90% of the system's RAM than say 10% which would be more typical on server hardware. However RAM exhaustion could happen on a server also if processes collectively use up all of it. Also I guess there are resources other than RAM whereby userland could exhaust the system.

And for that I speculated that such event logic could be implemented as some in-kernel code e.g. as a kernel thread, if those have some kind of
higher execution guarantee than user process code,

Most probably, you are well aware of kernel level tracing and debugging.
[..]
Debugging user programs, and the kernel, is well documented in manuals.
Maybe you have some idea or proposal, that I am not able to understand.

What I was looking for is some foolproof logic for system exhaustion caused by the userland, to dump state, sync filesystems, and reboot.

Kernel tracing and debugging functionality is perhaps involved in some sense but not in the ordinary sense of being used by an admin via the console.

SoftECC (a bit-flip detection mechanism / an ECC emulator) wouldn't help this.


If you have any thought about how make that happen feel free to share.

Anyhow in the absence of any such logic, just doing a hardware reset is fine, it's just a bit constrained as it comes without automated reporting&recording that could be used to distinguish hardware/kernel issues from userland issues, which encourages hardware replacement and userland software debugging beyond what's really necessary.

Tinker

Reply via email to