Christopher Smith wrote:
If you are talking about programming errors, then recovery is the wrong
kind of behavior. At most you want the code to checkpoint and then core
dump, preferably in some suitably annoying way that causes someone to
notice.
Errr, no, not really. If someone puts a CGI script on my shared web
server that has a bug in it, I don't really want to coredump the web
server and take down my whole business, just so I notice.
Instead, I recover gracefully and send myself a page, which is suitably
annoying enough to notice, thankyouverymuch.
> When you have hundreds of nodes to manage, you don't want
programmer errors floating around the ether and code trying to recover
from it.
"Recover from it" involves logging the error, cleaning up resources, and
restarting. Or, if it's user-submitted code, setting a flag saying not
to try to run that any more, telling the user, etc.
Why would I want to leave files hanging open, sockets connected, memory
allocated, and database transactions inconsistent? Oh, wait, that
doesn't happen because people wrote this OS that tries (unsuccessfully)
to clean up after you when you fail.
This violates the "fail fast" principle. After all, if some
programmer in some unknown place did write code that divides by zero,
how is your exception handler to know how to fix it?
It doesn't violate the "fail fast" principle any more than the OS
closing open files and freeing your memory violates the "fail fast"
principle. You're just saying "rely on the OS to be properly coded to do
this for you" instead of "rely on your interpreter/compiler to be
properly coded to do this for you."
I'm sorry, if you have hundreds of nodes, you have to *expect* hardware
failures on a regular basis.
Right. But I don't need to do it at the same level of abstraction. If
the whole rack catches on fire, dumping core or logging errors becomes
irrelevant, and the next level of monitoring kicks in.
Okay, I'm now thoroughly confused by what you mean by an unexpected
error then. If it doesn't occur when the hardware does something
unexpected,
Who said that? How many application programs have you written that check
when they add two numbers, they come up with the right answer? Or that
after you store something in an integer variable, it doesn't change
before you read it next time?
Sure, if you're launching an interplanetary satellite, you check these
things. Otherwise, the probability of that going south is too low to
worry about. Far, far lower than the probability that you've made a
coding error, or that what you coded isn't what you wanted, or that the
program will live out its entire lifecycle without ever running across a
problem that happens once every 2^128 instructions.
I mean, people still make fun of Intel for having had a bug in their
math processor.
the software does something unexpected,
Well, yes, that too. But again, how many application programs have you
written check that the compiler outputted the right code? Sure, if
you're launching an interplanetary satellite, you look at the compiled
code and check it against the source and make sure there aren't any
problems.
How do you handle it when the kernel's bug decides to just not relaunch
the file that just core dumped? What do you do when the file system
starts writing directory blocks over top of your inode tables? What do
you do when you start getting piles of processes that even a kill -9
won't get rid of?
There are some errors you don't bother trying to recover from
automatically. (Altho that last one I actually wrote a script to recover
from.)
or the programmer does something unintended...
Right. Like there's never been a time when people wrote faulty code that
a range-checking language (as an example) would have caught at runtime.
Either that or the programmer
is incompetent and is just as likely to screw up the error handling.
You have to handle the error *some* way. I'm not sure why handling
inside your code the errors that are known not to screw up your language
semantics isn't just as good as handling it in some other piece of code.
And no, while it's difficult to write good error handling, once you have
the error handling in place, chances are it's covering a lot of code.
Once you have transactional rollback in your database, it too covers all
kinds of errors, including your application dumping core.
--
Darren New / San Diego, CA, USA (PST)
His kernel fu is strong.
He studied at the Shao Linux Temple.
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg