Darren New wrote:
Note, incidentally, I'm talking about programming errors, including
should-have-been-expected errors like using a file that's already been
closed, opening a socket that's already open by another process, etc.
I'm also talking about programming errors like violating what the
language says you're allowed to do, like using unset variables,
running off arrays, etc.
If you are talking about programming errors, then recovery is the wrong
kind of behavior. At most you want the code to checkpoint and then core
dump, preferably in some suitably annoying way that causes someone to
notice. When you have hundreds of nodes to manage, you don't want
programmer errors floating around the ether and code trying to recover
from it. This violates the "fail fast" principle. After all, if some
programmer in some unknown place did write code that divides by zero,
how is your exception handler to know how to fix it?
For errors that *nobody* anticipated, like the CPU not following its
own specs, you have to do Space Shuttle engineering-level work, which
nobody really wants to pay for. Or you have to do Google-level work,
making everything redundant enough that having portions fail is just
business as usual.
I'm sorry, if you have hundreds of nodes, you have to *expect* hardware
failures on a regular basis.
And errors where the behavior of the program is exactly as you wrote
rather than exactly as you intended, it's also difficult to
compensate. Not impossible, but difficult, involving more of a
structural component than simple programming choices.
Okay, I'm now thoroughly confused by what you mean by an unexpected
error then. If it doesn't occur when the hardware does something
unexpected, the software does something unexpected, or the programmer
does something unintended... that would seem to leave the case of the
programmer doing exactly what they intended, which I'd think would
presumably lead to something... expected. Either that or the programmer
is incompetent and is just as likely to screw up the error handling.
--Chris
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg