On Monday, 29 September 2014 at 02:57:03 UTC, Walter Bright wrote:
I've said that processes are different, because the scope of
the effects is limited by the hardware.
If a system with threads that share memory cannot be restarted,
there are serious problems with the design of it, because a
crash and the necessary restart are going to happen sooner or
later, probably sooner.
Right. But if the condition that caused the restart persists,
the process can end up in a cascading restart scenario. Simply
restarting on error isn't necessarily enough.
I don't believe that the way to get 6 sigma reliability is by
ignoring errors and hoping. Airplane software is most certainly
not done that way.
I believe I was arguing the opposite. More to the point, I think
it's necessary to expect undefined behavior to occur and to plan
for it. I think we're on the same page here and just
miscommunicating.
I recall Toyota got into trouble with their computer controlled
cars because of their idea of handling of inevitable bugs and
errors. It was one process that controlled everything. When
something unexpected went wrong, it kept right on operating,
any unknown and unintended consequences be damned.
The way to get reliable systems is to design to accommodate
errors, not pretend they didn't happen, or hope that nothing
else got affected, etc. In critical software systems, that
means shut down and restart the offending system, or engage the
backup.
My point was that it's often more complicated than that. There
have been papers written on self-repairing systems, for example,
and ways to design systems that are inherently durable when it
comes to even internal errors. I think what I'm trying to say is
that simply aborting on error is too brittle in some cases,
because it only deals with one vector--memory corruption that is
unlikely to reoccur. But I've watched always-on systems fall
apart from some unexpected ongoing situation, and simply
restarting doesn't actually help.