I tend to disagree with the language being of a lot of importance.  But at
the same time it can be.  I have work in many real time OS, many home brew
and some WindRiver stuff.  All on critical systems, test equipment for
nuclear reactors, air planes and such.  The Linux on zOS is new to me but
Cole Software, my current employer, uses the VMS error recover as the core
to our debugger.

From years of working on systems where failure is NOT trivial there are
several things that I learned:
1. Use the language that gives your the best access to the parts of the
system you need when you needed it; while staying as high level as
possible.  Thus if it takes assembler to access error recovery, use it for
those parts ONLY!  If you can register C code or C++ code for error
recovery than use that.
2. Use as 'safe' of a language as possible for as much of the system as
possible.  Where 'safe' implies high level and easy to verify.  (On many
jobs C with HEAVY use of Lint was the best.  A good lint tool catches MANY
C/C++ memory errors.)
3. TRAP ALL ERRORS!  Hook into all OS hooks possible.  If rolling your own
OS give the best hooks possible.
4. Identify 'safe' recovery points, these are points that can be 'jumped'
into for recovery.  So if you get a memory fault from some routine X know
that restarting at routine Z will be the most likely provide the best
recover, with the least losses.  (If Z fails than know what the next best
recover point is, with a total restart being the point of last resort.)
5. Activity log!!!  If errors or oddities or less likely things happen
create a log so that the problem can be re-assessed later and so the
operator can know that they need to question there data.  (As a secondary
issue design debug code so it STAYS in for production operation.  Basic NRC
and other critical system rule, "test *the* code that will run."  Don't
test then recompile with new switches and ship it!!!  Very bad!)

Basically no silver bullets just be a good job of design, test and
development.  (Designing and developing to your tests; test design should
come *before* code design.)

A bunch of years back I show a university student paper about
"self-healing" systems.  Where an OS would trap things like memory faults
then use a whole lot of rule based Artifical Intelegence to track back
through the application code to try and figure out it 'meant' to do.  It
was an interesting read but considering how much more I have seen about it
(ZERO) in the years since I guess it never got worked out quite right 8-)


At 08:48 AM 2003_10_29, you wrote:
> Recovery is only as good as the language framework allows it to be.
> Compilers insulate you from the data and the hardware, and
> reduce your level
> of control over how errors are handled.
> But that's part of what you're buying by using a compiler in the first
> place:  Not to have to worry about all those "little details".

I don't disagree that having the compiler worry about some things is OK;
it's more the assumption that "oh, this is Java so I don't have to do return
code checking or worry about that 10G malloc() call on a small machine" that
gets me miffed.  No matter what language you use, there's still some basic
sanity checking that has to be done to ensure stability, and it's getting
rarer and rarer.

>  An infinitely smart programmer
> could conceivably
> write enough code to fix or recover from ANY failure, but how
> many of THOSE
> are there?

Nobody's asking for perfection, but stupid little stuff like not checking
arguments for sanity or indexing off the end of a string because you're too
lazy to check the length before you increment is just lousy style.
Eradicating code like that is a moral imperative.

> And who writes in assembler anymore anyway?

Still a fair amount of it for those of us doing embedded work. :caveman. Me
Make Hardware Go. Ugh. :ecaveman.

-- db

Dale Strickler Cole Software, LLC Voice: 540-456-8896 Fax: 540-456-6658 Web: http://www.colesoft.com/

Reply via email to