On Tuesday, 18 February 2014 at 23:05:21 UTC, Walter Bright wrote:
http://cacm.acm.org/magazines/2014/2/171689-mars-code/fulltext
Some interesting tidbits:
"We later revised it to require that the flight software as a
whole, and each module within it, had to reach a minimal
assertion density of 2%. There is compelling evidence that
higher assertion densities correlate with lower residual defect
densities."
This has been my experience with asserts, too.
"A failing assertion is now tied in with the fault-protection
system and by default places the spacecraft into a predefined
safe state where the cause of the failure can be diagnosed
carefully before normal operation is resumed."
Nice to see confirmation of that.
"Running the same landing software on two CPUs in parallel
offers little protection against software defects. Two
different versions of the entry-descent-and-landing code were
therefore developed, with the version running on the backup CPU
a simplified version of the primary version running on the main
CPU. In the case where the main CPU would have unexpectedly
failed during the landing sequence, the backup CPU was
programmed to take control and continue the sequence following
the simplified procedure."
An example of using dual systems for reliability.
TL;DR the link though, how are they detecting that a CPU fails?
An information must be passes outside of CPU to do this. The only
solution comes to my mind is that main CPU changes a variable on
an external memory at every step, and back up CPU checks it
continuously to catch a failure immediately. But this would
require about 50% of CPU's power already.
While thinking about this kind of back up systems, knowing and
reading that some people are really doing is really great.