Re: How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

Tinker Tue, 18 Oct 2016 04:48:08 -0700

Thanks for your remarks Anton (below).

What Anton said leads to an interesting question, which is, whatcharacteristics does a program have to have to be sink-proof?

This is interesting to know for the design of a "supervisory program"whose only function is to check that another program is alive - if itfroze shut it down, if it shut down restart it - and all the while notsink itself. That's all.

Sink-proof in the sense that the likelihood is as close to possible tozero that it would terminate or its execution otherwise would stopbecause the system would be out of memory, descriptors, fairly jammedkernel, whatever - if even swapping of binary and heap from resident RAMto disk could be prevented even that would be useful.



Any code examples or principles available?

Wild guesses: Keep it minimal and to only the absolute basics e.g.printf()/fprintf()/(f?)write()/(f?)read()/select() + fsync(), limited tooutput to console or serial IO only and to read from 'watchdog' pipeonly, kill() + wait(), fork() + exec()/execve(), sysctl() to check andreport free memory if relevant, only utilize buffers on the stack orallocated on process start, and don't do any malloc(), and.. run thesupervisory program with lower niceness than the supervised program,and, run it as root??

Finally, the signal is preferably SIGQUIT ashttp://man.openbsd.org/sigaction.2 says that should produce a core dump.Yey!


(Best thing obviously is to run software that works.)

Tinker

On 2016-10-18 13:38, [email protected] wrote:

Tue, 18 Oct 2016 12:40:10 +0800 Tinker <[email protected]>

Anton,

Thanks for your remarks and clarifications,

Sorry if the question not appeared perfectly clear from the beginning.
Reset by HW watchdog would not dump state. (Thanks for pointing outthat
it exists though, wasn't aware.)


Hi Tinker,

In this case, you most probably need to make sure you go though agraceful(non crash / panic) OS halt resulting in the required level of statekeep.

I see the rationality in your suggestion that BSD/Unix is a thinner
abstraction than as to contain userland failure handling logics (i.e.
userland is presumed to work).

If processes die out of program error / get killed because they exceedtheallowed resource allocated, a dedicated process that monitors them andforthe most common case optionally restarts / respawns them - known assystemsupervisor program, see:https://en.wikipedia.org/wiki/Supervisory_programThis is a common problem usually also solved for system services /daemonsvia privilege separation where a parent minimal process runs backgroundashigh privileged program, and a child / separate process runs user levelasthe program that can get abused or suffer unexpected operatingconditions.What you're probably thinking like is a form of system monitor, whichin aUNIX like system is typically realised as resource limits andsupervisors.Here is another article:https://en.wikipedia.org/wiki/Process_supervision

Also I agree the best thing is that userland never breaks the system.
This might be realistic. I had some experiences with machines thatdied
totally because of userland, that's why I raised this topic at all.

Yes, I know what you mean, however, this is not the design of the OS,thatshould factor in incorrect / poor programs, they're supposed to hit ahardlimit and terminate / die suddenly with an explicit error. Thendependingon the software stack arrangement depending mostly on the skills of theopor dev, the system could continue running as expected with a re-spawnedorother state processes table. Further, you can devise a special monitorofthe system running parameters and make automated decisions / systemcalls.

In the presence of some occasional userland crashes, I still thinkthere
is relevance in the idea of a kernel-level "watchdog" that dumps state
and reboots at timeout.

Probably, and most probably such a mechanism may exist in the kernellevelaround the panic / kernel debugger code, you have to ask and lookfurther.

I'm in a place where I'm running a piece of inhouse software that canbe

heavy. Using the HW watchdog would not help me distinguish userland vs
kernel issue. Implementing own "I'm alive" reporting from userland to
the network would however, though, such a solution would not get the
dump which would inform exactly where the actual halt happened.

You MUST have some resource limits, or another mechanism to guardrunaway.

So basically just a kernel patch to do the "ps", "trace /u", "boot
reboot" ddb(4) commands, when "echo >> /dev/kernelwatchdog" nothappened
for 60 secs.

You may be overly simplifying this, I know what you meant yet theconceptsof SoftECC were a mislead, and I was interested if somebody woulddiscuss.


Kind regards,
Anton

Anyhow, very sorry that you felt this bothered you - Have a good day!!
Tinker

On 2016-10-18 11:52, [email protected] wrote:
> Tue, 18 Oct 2016 10:47:51 +0800 Tinker <[email protected]>
>> Anton,
>>
>> On 2016-10-18 09:46, [email protected] wrote:
>> > Hi Tinker,
>> [..]
>> >> How to trig some event logic when the system has become vegetable
>> >> because of overload by the userland?
>> >
>> > You're referring here to a watchdog timer, as present in some (most)
>> > BMC
>> > controllers, this usually requires an OS timer reset process, see
>> > these:
>> [..]
>> > The watchdog is realised in HW with a BIOS option to enable its
>> > timeout.
>> > When timer is not cleared by the OS process, the BMC reboots the
>> > system.
>> [..]
>> > timer with a SW guard process.
>>
>> This is an ARM SBC, it has no BMC and AFAIK no watchdog or other timer
>> that can be programmed to cause a reboot, if you are aware of anything
>> like that on ARM SBC:s let me know?
>
> Hi Tinker,
>
> Do you realise you just performed sudden thread mutation, and now all
> your
> previous posts seem totally misleading?  Why did you hide the arm
> platform
> facts, and any hardware related detail up to this point?  Did you take
> any
> care to do some research before posting this?  All embedded soc
> platforms,
> all of them dating back to the dawn of embedded controllers have some
> form
> of timers, watchdogs, counters, reset functionality by design, look it
> up.
> Your opening link to the FakeECC noises on smackexchange is totally
> fubar.
> These have nothing to do with arm platforms, I think you picked your
> nose.
>
>> >> My limited experience here says that system overload caused by user
>> >> processes can lead to that all processes die or freeze, and that the
>> >> system goes otherwise unresponsive, except for that terminal input
>> >> still
>> >> is echoed.
>> >
>> > Well, what are the process limits used for then, these should help
>> > here?
>> > Then as difficult as it gets, the mission is to run the system
>> > reliably.
>>
>> Because of limited RAM, RAM is scarce and under some pressure.
>>
>> Running out of RAM is closer to happening on a limited-resources
>> machine
>> like this where one process may rather consume 50-90% of the system's
>> RAM than say 10% which would be more typical on server hardware.
>> However
>> RAM exhaustion could happen on a server also if processes collectively
>> use up all of it. Also I guess there are resources other than RAM
>> whereby userland could exhaust the system.
>
> This is complete nonsense, I'll skip the copy to the public list,
> because
> it may look like I am attacking your posts, while I am not.  However
> it's
> obvious you are making ridiculous twists in your thread posts, you're
> not
> making any sense at all.  The process limits, overall resource
> allocation
> and on top of this careful program design takes care of this.  You
> ignore
> this mechanism, and want some kernel land magic thing to profile or
> watch
> over your user land processes, this is utterly ignorant.  If you drop
> the
> limits and allow some process to hog everything, you are purposely
> making
> your system resource constrained and nothing ring0 will help you, you
> are
> at this point timing out & the watchdog reboots the system, or you
> crash.
>
> Also, you are making one hugely blatant mistake:  you are trying to use
> a
> system designed for one mode of operation for a different one, and
> invent
> some sudden game changer just to continue asking for a salvation from
> the
> kernel, where you try to run with scissors up and down ladder, on
> stairs.
>
>> >> And for that I speculated that such event logic could be implemented
>> >> as
>> >> some in-kernel code e.g. as a kernel thread, if those have some kind
>> >> of
>> >> higher execution guarantee than user process code,
>> >
>> > Most probably, you are well aware of kernel level tracing and
>> > debugging.
>> [..]
>> > Debugging user programs, and the kernel, is well documented in manuals.
>> > Maybe you have some idea or proposal, that I am not able to understand.
>>
>> What I was looking for is some foolproof logic for system exhaustion
>> caused by the userland, to dump state, sync filesystems, and reboot.
>
> If you try to invent a completely foolproof system only fools can use
> it.
> I have no time to talk nonsense any more please hear from others,
> Tinker.
>
> Kind regards,
> Anton
>
>> Kernel tracing and debugging functionality is perhaps involved in some
>> sense but not in the ordinary sense of being used by an admin via the
>> console.
>>
>> SoftECC (a bit-flip detection mechanism / an ECC emulator) wouldn't
>> help
>> this.
>>
>>
>> If you have any thought about how make that happen feel free to share.
>>
>> Anyhow in the absence of any such logic, just doing a hardware reset
>> is
>
> P.S. Whatever, please don't waste more time on this, you messed this
> up.
>
>> fine, it's just a bit constrained as it comes without automated
>> reporting&recording that could be used to distinguish hardware/kernel
>> issues from userland issues, which encourages hardware replacement and
>> userland software debugging beyond what's really necessary.
>>
>> Tinker

Re: How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

Reply via email to