[it should be noted that a lot of this is second-hand; I haven't implemented many fault-tolerant programs.]
Computers crash. This has been true as long as there have been computers, and there doesn't seem to be any end to computers crashing in sight. Computers crash for many reasons. The most common computer crashes in the past have been due to faulty operating-system software and unreliable hardware; today, it is probably more common for computers to crash because they lose power, because of a mains power outage or a battery running out of power. Sometimes computers malfunction, too, but there are well-known ways to turn most malfunctions into crashes, which it turns out is often a good idea. When a computer crashes, the programs on it almost never have a chance to respond to the imminent crash; they just stop running. This means that if not having a chance to respond before exiting breaks them, they will break when the computer crashes. In modern computing environments, there are other things that can cause programs to terminate unexpectedly. Most operating systems let users terminate their programs without warning to the programs. Many programming languages provide an exception facility that terminates subroutines without warning; although they usually provide a way to run cleanup code during the propagation of the exception (finally in Java and Python, unwind-protect in Common Lisp, dynamic-wind in Scheme, local variable destructors in C++), this facility tends to have problems of its own --- if cleanup code run from it raises an exception, one exception or the other, or both, will be lost, and the rest of the cleanup code at that level will fail to run. Finally, some operating systems (Linux in particular, but others too) overcommit memory, which means that they will occasionally need to kill programs unexpectedly. While there are no known ways to prevent all computer crashes, there are known ways to keep programs from breaking when computers crash. First, write functional code --- code that changes no state, simply computes answers from input data. Breakage in the presence of unexpected termination consists of state being left inconsistent. If your code isn't changing any state, it can't leave that state inconsistent. Your program as a whole can have state; it's just that the parts that have state are the parts that you have to use more sophisticated techniques to keep from breaking. Second, make state changes in a restartable fashion, and journal them. A restartable change is one that, if interrupted, can be done again from the beginning, and have the same effect as if it had been done from the beginning to start with. Restartability implies idempotence --- the ability to do the same change more than once and have it have the same effect as doing it once. Writing a value into a location is restartable; so is writing a set of values into a set of locations. Incrementing a set of values is not idempotent, and therefore not restartable, because doing it twice has a different result than doing it once. "Journaling" means that you write down the changes you're about to make before you start making them. When you're recovering from a possible unexpected termination, you check to see if there are any journal records that might need to be restarted, and restart them. The larger the granularity of the state changes you journal, the better restartability works and the less hassle it is. The larger the granularity, of course, the more time it takes to rerun journaled changes. All of this assumes that you're keeping your state and your journal somewhere that unexpected termination won't screw it up. What this is depends on what kind of unexpected termination you're preparing for; if it's a machine crash, then another machine or a disk is probably a good choice, but if it's the destruction of the machine, a disk is not good enough, and if it's just an unhandled exception, data structures in main memory are good enough, as long as the exception gets caught before it destroys them too. Third, limit the size of your essential state. If you have a "dependent" piece of state whose correct value can be computed from some other "essential" piece of state you have, then you are free to discard the dependent state on unexpected termination and recompute it from the essential state, either during recovery or when you need it. This simplifies your life. Fourth, make all of your state changes atomically when you finish computing what they are. Running your code under the control of a transaction processing monitor normally makes this happen automatically. This gives you the advantages of restartability without adding to the complexity of your code, but it costs more in performance, too. Fifth, use checkpoints. If it's OK to lose your most recent changes, you can periodically copy all of your program's state to somewhere else, sometime when it's consistent. Then you can restart the computation from the most recent checkpoint to recover from unexpected termination. When using checkpoints, it's often helpful to keep more than one of them saved, and to run a consistency checker on the current state before copying it over. If the consistency checker finds inconsistencies, a simple way to handle the situation is to cause an unexpected termination. Sixth, automatically retry things that fail. Several of these techniques assume that, sooner or later, someone will come along and retry the transaction, restart from the checkpoint, or restart the restartable operation. This is a general technique that often works to cope with failures in underlying systems. Retry strategies depend on the underlying failure you're trying to cope with. Butler Lampson wrote a paper in 1983 entitled "Hints for Computer System Design", in which he documented several of these techniques. At present, he has it online at http://research.microsoft.com/~lampson/33-Hints/WebPage.html

