On Monday, 29 September 2014 at 02:57:03 UTC, Walter Bright wrote:

I've said that processes are different, because the scope of the effects is limited by the hardware.

If a system with threads that share memory cannot be restarted, there are serious problems with the design of it, because a crash and the necessary restart are going to happen sooner or later, probably sooner.

Right. But if the condition that caused the restart persists, the process can end up in a cascading restart scenario. Simply restarting on error isn't necessarily enough.


I don't believe that the way to get 6 sigma reliability is by ignoring errors and hoping. Airplane software is most certainly not done that way.

I believe I was arguing the opposite. More to the point, I think it's necessary to expect undefined behavior to occur and to plan for it. I think we're on the same page here and just miscommunicating.


I recall Toyota got into trouble with their computer controlled cars because of their idea of handling of inevitable bugs and errors. It was one process that controlled everything. When something unexpected went wrong, it kept right on operating, any unknown and unintended consequences be damned.

The way to get reliable systems is to design to accommodate errors, not pretend they didn't happen, or hope that nothing else got affected, etc. In critical software systems, that means shut down and restart the offending system, or engage the backup.

My point was that it's often more complicated than that. There have been papers written on self-repairing systems, for example, and ways to design systems that are inherently durable when it comes to even internal errors. I think what I'm trying to say is that simply aborting on error is too brittle in some cases, because it only deals with one vector--memory corruption that is unlikely to reoccur. But I've watched always-on systems fall apart from some unexpected ongoing situation, and simply restarting doesn't actually help.

Reply via email to