Darren New wrote:
Note, incidentally, I'm talking about programming errors, including should-have-been-expected errors like using a file that's already been closed, opening a socket that's already open by another process, etc. I'm also talking about programming errors like violating what the language says you're allowed to do, like using unset variables, running off arrays, etc.
If you are talking about programming errors, then recovery is the wrong kind of behavior. At most you want the code to checkpoint and then core dump, preferably in some suitably annoying way that causes someone to notice. When you have hundreds of nodes to manage, you don't want programmer errors floating around the ether and code trying to recover from it. This violates the "fail fast" principle. After all, if some programmer in some unknown place did write code that divides by zero, how is your exception handler to know how to fix it?
For errors that *nobody* anticipated, like the CPU not following its own specs, you have to do Space Shuttle engineering-level work, which nobody really wants to pay for. Or you have to do Google-level work, making everything redundant enough that having portions fail is just business as usual.
I'm sorry, if you have hundreds of nodes, you have to *expect* hardware failures on a regular basis.
And errors where the behavior of the program is exactly as you wrote rather than exactly as you intended, it's also difficult to compensate. Not impossible, but difficult, involving more of a structural component than simple programming choices.
Okay, I'm now thoroughly confused by what you mean by an unexpected error then. If it doesn't occur when the hardware does something unexpected, the software does something unexpected, or the programmer does something unintended... that would seem to leave the case of the programmer doing exactly what they intended, which I'd think would presumably lead to something... expected. Either that or the programmer is incompetent and is just as likely to screw up the error handling.

--Chris

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Reply via email to