On Saturday, 18 January 2014 at 01:46:55 UTC, Walter Bright wrote:
On 1/17/2014 4:44 PM, "Ola Fosheim Grøstad" <[email protected]>" wrote:
Big systems have to live with bugs, it is inevitable that they run with bugs.

It's a dark and stormy night. You're in a 747 on final approach, flying on autopilot.

Scenario 1
----------

The autopilot software was designed by someone who thought it should keep operating even if it detects faults in the software. The software runs into a null object when there shouldn't be one, and starts feeding bad values to the controls. The plane flips over and crashes, everybody dies. But hey, the software kept on truckin'!

Scenario 2
----------

The autopilot software was designed by Boeing. Actually, there are two autopilots, each independently developed, with different CPUs, different hardware, different algorithms, different languages, etc. One has a null pointer fault. A deadman circuit sees this, and automatically shuts that autopilot down. The other autopilot immediately takes over. The pilot is informed that one of the autopilots failed, and the pilot immediately shuts off the remaining autopilot and lands manually. The passengers all get to go home.


Note that in both scenarios there are bugs in the software. Yes there have been incidents with earlier autopilots where bugs in it caused the airplane to go inverted.

Consider also the Toyota. My understanding from reading reports (admittedly journalists botch up the facts) is that a single computer controls the brakes, engine, throttle, ignition switch, etc. Oh joy. I wouldn't want to be in that car when it keeps on going despite having self-detected faults. It could, you know, start going at full throttle and ignore all signals to brake or turn off, only stopping when it crashes or runs out of gas.

You are running a huge website. Let's say for instance a social
network with more than a billion users.

Scenario 1
----------

The software was designed by someone who thought it should keep
operating even if it detects faults in the software. A bug arise
in some fronted and it start corruption data. Some monitoring
detects the issue, the code get fixed and corrupted data are
recovered from backup. Users that ended up on that cluster saw
their account not working for a day, but everything is back to
normal the day after.

Scenario 2
----------

The software was designed by an ex employee from boeing. He know
that he should make his software crash hard and soon. As soon as
some error are detected on a cluster, the cluster goes down.
Hopefully, no data is corrupted, but the load on that cluster
must now be handled by other cluster. Soon enough, these clusters
overload and the whole website goes down. Hopefully, no data were
corrupted in the process, so it isn't needed to restore anything
from backup.



Different software, different needs. Ultimately, that distinction
is irrelevant anyway. The whole possibility of these scenarios
can be avoided in case of null dereferences by proper language
design.

Reply via email to