Darren New wrote:
Christopher Smith wrote:
If you are talking about programming errors, then recovery is the wrong kind of behavior. At most you want the code to checkpoint and then core dump, preferably in some suitably annoying way that causes someone to notice.

Errr, no, not really. If someone puts a CGI script on my shared web server that has a bug in it, I don't really want to coredump the web server and take down my whole business, just so I notice.
No, but you'd probably want that CGI script's process to die and core dump, while the web server goes about its business. Maybe you worry about running out of disk space for cores, but fortunately you've set reasonable sysctl parameters so that you don't worry so much about that.
Instead, I recover gracefully and send myself a page, which is suitably annoying enough to notice, thankyouverymuch.
This is exactly what I described as happening with our systems. Why does everyone think dumping core is the end of the world?
> When you have hundreds of nodes to manage, you don't want
programmer errors floating around the ether and code trying to recover from it.

"Recover from it" involves logging the error, cleaning up resources, and restarting. Or, if it's user-submitted code, setting a flag saying not to try to run that any more, telling the user, etc.

Why would I want to leave files hanging open, sockets connected, memory allocated, and database transactions inconsistent? Oh, wait, that doesn't happen because people wrote this OS that tries (unsuccessfully) to clean up after you when you fail.
Okay, if the OS can't clean up file handles, sockets, and memory when your process dies, I very much doubt you're going to have much luck doing so in some catch block. Similarly for your database with database transactions. Seriously, process death is about as safe a way as you can find to clean up from an undefined error.
This violates the "fail fast" principle. After all, if some programmer in some unknown place did write code that divides by zero, how is your exception handler to know how to fix it?

It doesn't violate the "fail fast" principle any more than the OS closing open files and freeing your memory violates the "fail fast" principle. You're just saying "rely on the OS to be properly coded to do this for you" instead of "rely on your interpreter/compiler to be properly coded to do this for you."
Yes, because your interpreter/compiler is inherently having to trust all kinds of other code beyond the OS to be implemented correctly. Most of that code doesn't have a good way to work from the "I don't trust any of this stuff" perspective, nor has it been tested in that regard nearly as much.
I'm sorry, if you have hundreds of nodes, you have to *expect* hardware failures on a regular basis.
Right. But I don't need to do it at the same level of abstraction. If the whole rack catches on fire, dumping core or logging errors becomes irrelevant, and the next level of monitoring kicks in.
Sure, but there are far more subtle errors that pretty much or only going to show up in the form of your code finding itself in an unexpected state.
There are some errors you don't bother trying to recover from automatically. (Altho that last one I actually wrote a script to recover from.)
I couldn't agree more. Unexpected errors are very dangerous to try to recover from, particularly because by their very nature you can't be sure whether you really have a recoverable problem or not. Even exiting a process is no guarantee, but it is probably the best you can do.
or the programmer does something unintended...

Right. Like there's never been a time when people wrote faulty code that a range-checking language (as an example) would have caught at runtime.
Why do you need a range-checking language? How about a range checking range? Is there some magic that comes from having a range-checking language that calls down to code in a non-range checking language vs. uses a range-checked iterator that calls down to code written in a non-range checked language?
Either that or the programmer is incompetent and is just as likely to screw up the error handling.

You have to handle the error *some* way. I'm not sure why handling inside your code the errors that are known not to screw up your language semantics isn't just as good as handling it in some other piece of code.
Because I've been burned by people doing exactly this kind of well intentioned coding far too often. Their stack has been corrupted, and they don't realize it, so instead of simply failing, they try to "recover" from the problem, only their stack is corrupted, so their "recovery" that is supposed to just clean things up ends up setting someone's account balance to zero, or causes the system that provably can't deadlock to deadlock, etc.
And no, while it's difficult to write good error handling, once you have the error handling in place, chances are it's covering a lot of code.
I didn't say that good error handling can't be done, merely that if you expectation is that you are recovering from a logical error, I'd sure like the recovery code to be the product of some other development process.
Once you have transactional rollback in your database, it too covers all kinds of errors, including your application dumping core.
Yes, because the database has "client connection died" as one of its *expected* errors.

--Chris

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Reply via email to