I tend to disagree with the language being of a lot of importance. But at the same time it can be. I have work in many real time OS, many home brew and some WindRiver stuff. All on critical systems, test equipment for nuclear reactors, air planes and such. The Linux on zOS is new to me but Cole Software, my current employer, uses the VMS error recover as the core to our debugger.
From years of working on systems where failure is NOT trivial there are several things that I learned: 1. Use the language that gives your the best access to the parts of the system you need when you needed it; while staying as high level as possible. Thus if it takes assembler to access error recovery, use it for those parts ONLY! If you can register C code or C++ code for error recovery than use that. 2. Use as 'safe' of a language as possible for as much of the system as possible. Where 'safe' implies high level and easy to verify. (On many jobs C with HEAVY use of Lint was the best. A good lint tool catches MANY C/C++ memory errors.) 3. TRAP ALL ERRORS! Hook into all OS hooks possible. If rolling your own OS give the best hooks possible. 4. Identify 'safe' recovery points, these are points that can be 'jumped' into for recovery. So if you get a memory fault from some routine X know that restarting at routine Z will be the most likely provide the best recover, with the least losses. (If Z fails than know what the next best recover point is, with a total restart being the point of last resort.) 5. Activity log!!! If errors or oddities or less likely things happen create a log so that the problem can be re-assessed later and so the operator can know that they need to question there data. (As a secondary issue design debug code so it STAYS in for production operation. Basic NRC and other critical system rule, "test *the* code that will run." Don't test then recompile with new switches and ship it!!! Very bad!)
Basically no silver bullets just be a good job of design, test and development. (Designing and developing to your tests; test design should come *before* code design.)
A bunch of years back I show a university student paper about "self-healing" systems. Where an OS would trap things like memory faults then use a whole lot of rule based Artifical Intelegence to track back through the application code to try and figure out it 'meant' to do. It was an interesting read but considering how much more I have seen about it (ZERO) in the years since I guess it never got worked out quite right 8-)
At 08:48 AM 2003_10_29, you wrote:
> Recovery is only as good as the language framework allows it to be. > Compilers insulate you from the data and the hardware, and > reduce your level > of control over how errors are handled. > But that's part of what you're buying by using a compiler in the first > place: Not to have to worry about all those "little details".
I don't disagree that having the compiler worry about some things is OK; it's more the assumption that "oh, this is Java so I don't have to do return code checking or worry about that 10G malloc() call on a small machine" that gets me miffed. No matter what language you use, there's still some basic sanity checking that has to be done to ensure stability, and it's getting rarer and rarer.
> An infinitely smart programmer > could conceivably > write enough code to fix or recover from ANY failure, but how > many of THOSE > are there?
Nobody's asking for perfection, but stupid little stuff like not checking arguments for sanity or indexing off the end of a string because you're too lazy to check the length before you increment is just lousy style. Eradicating code like that is a moral imperative.
> And who writes in assembler anymore anyway?
Still a fair amount of it for those of us doing embedded work. :caveman. Me Make Hardware Go. Ugh. :ecaveman.
-- db
Dale Strickler Cole Software, LLC Voice: 540-456-8896 Fax: 540-456-6658 Web: http://www.colesoft.com/