On Fri, Oct 31, 2014 at 08:23:04PM +0000, Kagamin via Digitalmars-d wrote: > On Friday, 24 October 2014 at 18:47:59 UTC, H. S. Teoh via Digitalmars-d > wrote: > >Basically, if you want a component to recover from a serious problem > >like a failed assertion, the recovery code should be in a *separate* > >component. Otherwise, if the recovery code is within the failing > >component, you have no way to know if the recovery code itself has > >been compromised, and trusting that it will do the right thing is > >very dangerous (and is what often leads to nasty security exploits). > >The watcher must be separate from the watched, otherwise how can you > >trust the watcher? > > You make process isolation sound like a silver bullet, but failure can > happen on any scale from a temporary variable to global network. You > can't use process isolation to contain a failure of a larger than > process scale, and it's an overkill for a failure of a temporary > variable scale.
You're missing the point. The point is that a reliable system made of unreliable parts, can only be reliable if you have multiple *redundant* copies of each component that are *decoupled* from each other. The usual unit of isolation at the lowest level is that of a single process, because threads within a process has full access to memory shared by all threads. Therefore, they are not decoupled from each other, and therefore, you cannot put any confidence in the correct functioning of other threads once a single thread has become inconsistent. The only failsafe solution is to have multiple redundant processes, so when one process becomes inconsistent, you fallback to another process, *decoupled* process that is known to be good. This does not mean that process isolation is a "silver bullet" -- I never said any such thing. The same reasoning applies to larger components in the system as well. If you have a server that performs function X, and the server begins to malfunction, you cannot expect the server to fix itself -- because you don't know if a hacker hasn't rooted the server and is running exploit code instead of your application. The only 100% safe way to recover, is to have another redundant server (or more) that also performs function X, shutdown the malfunctioning server for investigation and repair, and in the meantime switch over to the redundant server to continue operations. You don't shutdown the *entire* network unless all redundant components have failed. The reason you cannot go below the process level as a unit of redundancy is because of coupling. The above design of failing over to a redundant module only works if the modules are completely decoupled from each other. Otherwise, you have end up with the situation where you have two redundant modules M1 and M2, but both of them share a common helper module M3. Then if M1 detects a problem, you cannot be 100% sure it's not caused by a problem with M3, so in this case if you just switch to M2, it will also fail in the same way. Similarly, you cannot guarantee that while malfunctioning, M1 may have somehow damaged M3, and thereby also making M2 unreliable. The only way to be 100% sure that failover will actually fix the problem, is to make sure that M1 and M2 are completely isolated from each other (e.g., by having two redundant copies of M3 that are isolated from each other). Since a single process is the unit of isolation in the OS, you can't go below this granularity: as I've already said, if one thread is malfunctioning, it may have trashed the data shared by all other threads in the same process, and therefore none of the other threads can be trusted to continue operating correctly. The only way to be 100% sure that failover will actually fix the problem, is to switch over to another process that you *know* is not coupled to the old, malfunctioning process. Attempting to have a process "fix itself" after detecting an inconsistency is unreliable -- you're leaving it up to chance whether or not the attempted recovery will actually work, and not make the problem worse. You cannot guarantee the recovery code itself hasn't been compromised by the failure -- because the recovery code exists in the same process and is vulnerable to the same problem that caused the original failure, and vulnerable to memory corruption caused by malfunctioning code prior to the point the problem was detected. Therefore, the recovery code is not trustworthy, and cannot be relied on to continue operating correctly. That kind of "maybe, maybe not" recovery is not what I'd want to put any trust in, especially when it comes to critical applications that can cost lives if things go wrong. T -- English has the lovely word "defenestrate", meaning "to execute by throwing someone out a window", or more recently "to remove Windows from a computer and replace it with something useful". :-) -- John Cowan
