Re: How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

Tinker Mon, 17 Oct 2016 19:49:36 -0700

Anton,

On 2016-10-18 09:46, [email protected] wrote:

Hi Tinker,

[..]

How to trig some event logic when the system has become vegetable
because of overload by the userland?
You're referring here to a watchdog timer, as present in some (most)BMCcontrollers, this usually requires an OS timer reset process, seethese:

[..]

The watchdog is realised in HW with a BIOS option to enable itstimeout.When timer is not cleared by the OS process, the BMC reboots thesystem.

[..]

timer with a SW guard process.

This is an ARM SBC, it has no BMC and AFAIK no watchdog or other timerthat can be programmed to cause a reboot, if you are aware of anythinglike that on ARM SBC:s let me know?

My limited experience here says that system overload caused by user
processes can lead to that all processes die or freeze, and that the
system goes otherwise unresponsive, except for that terminal inputstill
is echoed.
Well, what are the process limits used for then, these should helphere?Then as difficult as it gets, the mission is to run the systemreliably.


Because of limited RAM, RAM is scarce and under some pressure.

Running out of RAM is closer to happening on a limited-resources machinelike this where one process may rather consume 50-90% of the system'sRAM than say 10% which would be more typical on server hardware. HoweverRAM exhaustion could happen on a server also if processes collectivelyuse up all of it. Also I guess there are resources other than RAMwhereby userland could exhaust the system.

And for that I speculated that such event logic could be implementedassome in-kernel code e.g. as a kernel thread, if those have some kindof
higher execution guarantee than user process code,
Most probably, you are well aware of kernel level tracing anddebugging.

[..]

Debugging user programs, and the kernel, is well documented in manuals.
Maybe you have some idea or proposal, that I am not able to understand.

What I was looking for is some foolproof logic for system exhaustioncaused by the userland, to dump state, sync filesystems, and reboot.

Kernel tracing and debugging functionality is perhaps involved in somesense but not in the ordinary sense of being used by an admin via theconsole.

SoftECC (a bit-flip detection mechanism / an ECC emulator) wouldn't helpthis.



If you have any thought about how make that happen feel free to share.

Anyhow in the absence of any such logic, just doing a hardware reset isfine, it's just a bit constrained as it comes without automatedreporting&recording that could be used to distinguish hardware/kernelissues from userland issues, which encourages hardware replacement anduserland software debugging beyond what's really necessary.


Tinker

Re: How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

Reply via email to