Re: Linux C/C++ IDE

Christopher Smith Mon, 11 Jun 2007 10:25:25 -0700

Darren New wrote:

Christopher Smith wrote:
If you are talking about programming errors, then recovery is thewrong kind of behavior. At most you want the code to checkpoint andthen core dump, preferably in some suitably annoying way that causessomeone to notice.
Errr, no, not really. If someone puts a CGI script on my shared webserver that has a bug in it, I don't really want to coredump the webserver and take down my whole business, just so I notice.

No, but you'd probably want that CGI script's process to die and coredump, while the web server goes about its business. Maybe you worryabout running out of disk space for cores, but fortunately you've setreasonable sysctl parameters so that you don't worry so much about that.

Instead, I recover gracefully and send myself a page, which issuitably annoying enough to notice, thankyouverymuch.

This is exactly what I described as happening with our systems. Why doeseveryone think dumping core is the end of the world?

> When you have hundreds of nodes to manage, you don't want
programmer errors floating around the ether and code trying torecover from it.
"Recover from it" involves logging the error, cleaning up resources,and restarting. Or, if it's user-submitted code, setting a flag sayingnot to try to run that any more, telling the user, etc.
Why would I want to leave files hanging open, sockets connected,memory allocated, and database transactions inconsistent? Oh, wait,that doesn't happen because people wrote this OS that tries(unsuccessfully) to clean up after you when you fail.

Okay, if the OS can't clean up file handles, sockets, and memory whenyour process dies, I very much doubt you're going to have much luckdoing so in some catch block. Similarly for your database with databasetransactions. Seriously, process death is about as safe a way as you canfind to clean up from an undefined error.

This violates the "fail fast" principle. After all, if someprogrammer in some unknown place did write code that divides by zero,how is your exception handler to know how to fix it?
It doesn't violate the "fail fast" principle any more than the OSclosing open files and freeing your memory violates the "fail fast"principle. You're just saying "rely on the OS to be properly coded todo this for you" instead of "rely on your interpreter/compiler to beproperly coded to do this for you."

Yes, because your interpreter/compiler is inherently having to trust allkinds of other code beyond the OS to be implemented correctly. Most ofthat code doesn't have a good way to work from the "I don't trust any ofthis stuff" perspective, nor has it been tested in that regard nearly asmuch.

I'm sorry, if you have hundreds of nodes, you have to *expect*hardware failures on a regular basis.
Right. But I don't need to do it at the same level of abstraction. Ifthe whole rack catches on fire, dumping core or logging errors becomesirrelevant, and the next level of monitoring kicks in.

Sure, but there are far more subtle errors that pretty much or onlygoing to show up in the form of your code finding itself in anunexpected state.

There are some errors you don't bother trying to recover fromautomatically. (Altho that last one I actually wrote a script torecover from.)

I couldn't agree more. Unexpected errors are very dangerous to try torecover from, particularly because by their very nature you can't besure whether you really have a recoverable problem or not. Even exitinga process is no guarantee, but it is probably the best you can do.

or the programmer does something unintended...
Right. Like there's never been a time when people wrote faulty codethat a range-checking language (as an example) would have caught atruntime.

Why do you need a range-checking language? How about a range checkingrange? Is there some magic that comes from having a range-checkinglanguage that calls down to code in a non-range checking language vs.uses a range-checked iterator that calls down to code written in anon-range checked language?

Either that or the programmer is incompetent and is just as likely toscrew up the error handling.
You have to handle the error *some* way. I'm not sure why handlinginside your code the errors that are known not to screw up yourlanguage semantics isn't just as good as handling it in some otherpiece of code.

Because I've been burned by people doing exactly this kind of wellintentioned coding far too often. Their stack has been corrupted, andthey don't realize it, so instead of simply failing, they try to"recover" from the problem, only their stack is corrupted, so their"recovery" that is supposed to just clean things up ends up settingsomeone's account balance to zero, or causes the system that provablycan't deadlock to deadlock, etc.

And no, while it's difficult to write good error handling, once youhave the error handling in place, chances are it's covering a lot ofcode.

I didn't say that good error handling can't be done, merely that if youexpectation is that you are recovering from a logical error, I'd surelike the recovery code to be the product of some other development process.

Once you have transactional rollback in your database, it too coversall kinds of errors, including your application dumping core.

Yes, because the database has "client connection died" as one of its*expected* errors.


--Chris

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Re: Linux C/C++ IDE

Reply via email to