On 6/11/07, Christopher Smith <[EMAIL PROTECTED]> wrote:
Darren New wrote:
> Note, incidentally, I'm talking about programming errors, including
> should-have-been-expected errors like using a file that's already been
> closed, opening a socket that's already open by another process, etc.
> I'm also talking about programming errors like violating what the
> language says you're allowed to do, like using unset variables,
> running off arrays, etc.
If you are talking about programming errors, then recovery is the wrong
kind of behavior.  At most you want the code to checkpoint and then core
dump, preferably in some suitably annoying way that causes someone to
notice. When you have hundreds of nodes to manage, you don't want
programmer errors floating around the ether and code trying to recover
from it. This violates the "fail fast" principle. After all, if some
programmer in some unknown place did write code that divides by zero,
how is your exception handler to know how to fix it?
> For errors that *nobody* anticipated, like the CPU not following its
> own specs, you have to do Space Shuttle engineering-level work, which
> nobody really wants to pay for. Or you have to do Google-level work,
> making everything redundant enough that having portions fail is just
> business as usual.
I'm sorry, if you have hundreds of nodes, you have to *expect* hardware
failures on a regular basis.
> And errors where the behavior of the program is exactly as you wrote
> rather than exactly as you intended, it's also difficult to
> compensate. Not impossible, but difficult, involving more of a
> structural component than simple programming choices.
Okay, I'm now thoroughly confused by what you mean by an unexpected
error then. If it doesn't occur when the hardware does something
unexpected, the software does something unexpected, or the programmer
does something unintended... that would seem to leave the case of the
programmer doing exactly what they intended, which I'd think would
presumably lead to something... expected. Either that or the programmer
is incompetent and is just as likely to screw up the error handling.

Python-esque syntax here:

   numExceptions = 0
   for customer is customers
       try
           customer.performBilling()
       except Exception e
           self.logException(e)
           numExceptions += 1
           if numExceptions > maxExceptions
               return

I did this in a real project where a nightly cron job would compute
customer balances, charge their cards, send them "your card is denied"
messages, etc. Continuing to the next customer when one failed was
much better than stopping all billing. Often the failures were
confined to very specific cases and only 1 or 2 customers would fail
while the others would proceed just fine.

Another example for recovery is a GUI business application where you
let the user decide if they quit or continue at their own risk.
Sometimes they really need to continue.

While it's true that theoretically you can't prove anything about the
state of your code after an unexpected exception, it's also true that
practically there are cases where recovery does much more good than
harm.

-Chuck

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Reply via email to