Darren New wrote:
Christopher Smith wrote:
If you are talking about programming errors, then recovery is the
wrong kind of behavior. At most you want the code to checkpoint and
then core dump, preferably in some suitably annoying way that causes
someone to notice.
Errr, no, not really. If someone puts a CGI script on my shared web
server that has a bug in it, I don't really want to coredump the web
server and take down my whole business, just so I notice.
No, but you'd probably want that CGI script's process to die and core
dump, while the web server goes about its business. Maybe you worry
about running out of disk space for cores, but fortunately you've set
reasonable sysctl parameters so that you don't worry so much about that.
Instead, I recover gracefully and send myself a page, which is
suitably annoying enough to notice, thankyouverymuch.
This is exactly what I described as happening with our systems. Why does
everyone think dumping core is the end of the world?
> When you have hundreds of nodes to manage, you don't want
programmer errors floating around the ether and code trying to
recover from it.
"Recover from it" involves logging the error, cleaning up resources,
and restarting. Or, if it's user-submitted code, setting a flag saying
not to try to run that any more, telling the user, etc.
Why would I want to leave files hanging open, sockets connected,
memory allocated, and database transactions inconsistent? Oh, wait,
that doesn't happen because people wrote this OS that tries
(unsuccessfully) to clean up after you when you fail.
Okay, if the OS can't clean up file handles, sockets, and memory when
your process dies, I very much doubt you're going to have much luck
doing so in some catch block. Similarly for your database with database
transactions. Seriously, process death is about as safe a way as you can
find to clean up from an undefined error.
This violates the "fail fast" principle. After all, if some
programmer in some unknown place did write code that divides by zero,
how is your exception handler to know how to fix it?
It doesn't violate the "fail fast" principle any more than the OS
closing open files and freeing your memory violates the "fail fast"
principle. You're just saying "rely on the OS to be properly coded to
do this for you" instead of "rely on your interpreter/compiler to be
properly coded to do this for you."
Yes, because your interpreter/compiler is inherently having to trust all
kinds of other code beyond the OS to be implemented correctly. Most of
that code doesn't have a good way to work from the "I don't trust any of
this stuff" perspective, nor has it been tested in that regard nearly as
much.
I'm sorry, if you have hundreds of nodes, you have to *expect*
hardware failures on a regular basis.
Right. But I don't need to do it at the same level of abstraction. If
the whole rack catches on fire, dumping core or logging errors becomes
irrelevant, and the next level of monitoring kicks in.
Sure, but there are far more subtle errors that pretty much or only
going to show up in the form of your code finding itself in an
unexpected state.
There are some errors you don't bother trying to recover from
automatically. (Altho that last one I actually wrote a script to
recover from.)
I couldn't agree more. Unexpected errors are very dangerous to try to
recover from, particularly because by their very nature you can't be
sure whether you really have a recoverable problem or not. Even exiting
a process is no guarantee, but it is probably the best you can do.
or the programmer does something unintended...
Right. Like there's never been a time when people wrote faulty code
that a range-checking language (as an example) would have caught at
runtime.
Why do you need a range-checking language? How about a range checking
range? Is there some magic that comes from having a range-checking
language that calls down to code in a non-range checking language vs.
uses a range-checked iterator that calls down to code written in a
non-range checked language?
Either that or the programmer is incompetent and is just as likely to
screw up the error handling.
You have to handle the error *some* way. I'm not sure why handling
inside your code the errors that are known not to screw up your
language semantics isn't just as good as handling it in some other
piece of code.
Because I've been burned by people doing exactly this kind of well
intentioned coding far too often. Their stack has been corrupted, and
they don't realize it, so instead of simply failing, they try to
"recover" from the problem, only their stack is corrupted, so their
"recovery" that is supposed to just clean things up ends up setting
someone's account balance to zero, or causes the system that provably
can't deadlock to deadlock, etc.
And no, while it's difficult to write good error handling, once you
have the error handling in place, chances are it's covering a lot of
code.
I didn't say that good error handling can't be done, merely that if you
expectation is that you are recovering from a logical error, I'd sure
like the recovery code to be the product of some other development process.
Once you have transactional rollback in your database, it too covers
all kinds of errors, including your application dumping core.
Yes, because the database has "client connection died" as one of its
*expected* errors.
--Chris
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg