[it should be noted that a lot of this is second-hand; I haven't
implemented many fault-tolerant programs.]

Computers crash.  This has been true as long as there have been
computers, and there doesn't seem to be any end to computers crashing
in sight.

Computers crash for many reasons.  The most common computer crashes in
the past have been due to faulty operating-system software and
unreliable hardware; today, it is probably more common for computers
to crash because they lose power, because of a mains power outage or a
battery running out of power.

Sometimes computers malfunction, too, but there are well-known ways to
turn most malfunctions into crashes, which it turns out is often a
good idea.

When a computer crashes, the programs on it almost never have a chance
to respond to the imminent crash; they just stop running.  This means
that if not having a chance to respond before exiting breaks them,
they will break when the computer crashes.

In modern computing environments, there are other things that can
cause programs to terminate unexpectedly.  Most operating systems let
users terminate their programs without warning to the programs.  Many
programming languages provide an exception facility that terminates
subroutines without warning; although they usually provide a way to
run cleanup code during the propagation of the exception (finally in
Java and Python, unwind-protect in Common Lisp, dynamic-wind in
Scheme, local variable destructors in C++), this facility tends to
have problems of its own --- if cleanup code run from it raises an
exception, one exception or the other, or both, will be lost, and the
rest of the cleanup code at that level will fail to run.  Finally,
some operating systems (Linux in particular, but others too)
overcommit memory, which means that they will occasionally need to
kill programs unexpectedly.

While there are no known ways to prevent all computer crashes, there
are known ways to keep programs from breaking when computers crash.

First, write functional code --- code that changes no state, simply
computes answers from input data.  Breakage in the presence of
unexpected termination consists of state being left inconsistent.  If
your code isn't changing any state, it can't leave that state
inconsistent.  Your program as a whole can have state; it's just that
the parts that have state are the parts that you have to use more
sophisticated techniques to keep from breaking.

Second, make state changes in a restartable fashion, and journal them.
A restartable change is one that, if interrupted, can be done again
from the beginning, and have the same effect as if it had been done
from the beginning to start with.  Restartability implies idempotence
--- the ability to do the same change more than once and have it have
the same effect as doing it once.  Writing a value into a location is
restartable; so is writing a set of values into a set of locations.
Incrementing a set of values is not idempotent, and therefore not
restartable, because doing it twice has a different result than doing
it once.

"Journaling" means that you write down the changes you're about to
make before you start making them.  When you're recovering from a
possible unexpected termination, you check to see if there are any
journal records that might need to be restarted, and restart them.

The larger the granularity of the state changes you journal, the
better restartability works and the less hassle it is.  The larger the
granularity, of course, the more time it takes to rerun journaled
changes.

All of this assumes that you're keeping your state and your journal
somewhere that unexpected termination won't screw it up.  What this is
depends on what kind of unexpected termination you're preparing for;
if it's a machine crash, then another machine or a disk is probably a
good choice, but if it's the destruction of the machine, a disk is not
good enough, and if it's just an unhandled exception, data structures
in main memory are good enough, as long as the exception gets caught
before it destroys them too.

Third, limit the size of your essential state.  If you have a
"dependent" piece of state whose correct value can be computed from
some other "essential" piece of state you have, then you are free to
discard the dependent state on unexpected termination and recompute it
from the essential state, either during recovery or when you need it.
This simplifies your life.

Fourth, make all of your state changes atomically when you finish
computing what they are.  Running your code under the control of a
transaction processing monitor normally makes this happen
automatically.  This gives you the advantages of restartability
without adding to the complexity of your code, but it costs more in
performance, too.

Fifth, use checkpoints.  If it's OK to lose your most recent changes,
you can periodically copy all of your program's state to somewhere
else, sometime when it's consistent.  Then you can restart the
computation from the most recent checkpoint to recover from unexpected
termination.

When using checkpoints, it's often helpful to keep more than one of
them saved, and to run a consistency checker on the current state
before copying it over.  If the consistency checker finds
inconsistencies, a simple way to handle the situation is to cause an
unexpected termination.

Sixth, automatically retry things that fail.  Several of these
techniques assume that, sooner or later, someone will come along and
retry the transaction, restart from the checkpoint, or restart the
restartable operation.  This is a general technique that often works
to cope with failures in underlying systems.

Retry strategies depend on the underlying failure you're trying to
cope with.

Butler Lampson wrote a paper in 1983 entitled "Hints for Computer
System Design", in which he documented several of these techniques.
At present, he has it online at
http://research.microsoft.com/~lampson/33-Hints/WebPage.html

Reply via email to