Just my 2 centsŠ is the error code 1 the correct error code to return to the OS? I'm just curious if anywhere else in ZooKeeper a System.exit(1) may be called. It may make sense to either re-use that error code, or use a different one (if 1 is already used elsewhere for a different type of error, like "Invalid arguments" during start-up, for example).
If the error isn't an OOME, is there any clean-up ZK needs to do to maybe inform a cluster it's going down abruptly (maybe to gracefully begin a leader re-election if necessary, for example)? I'm +1 to fail-fast behavior. Ryan On 4/13/12 8:15 AM, "Scott Fines" <[email protected]> wrote: >On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM >for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd >args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't >remember off the top of my head). Will these triggers still be fired, or >will the catch-all prevent them? > >I'm still +1 for the change no matter what, but it's probably a good idea >to mention that in the docs if they don't work. > >Scott > >On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier ><[email protected]>wrote: > >> Hi everyone, >> >> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and >>I'd >> like some feedback from the user base on it. >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442 >> >> The current behavior of ZK when we get an uncaught exception is to log >>it >> and try to move on. This is arguably not the right thing to do, and will >> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for >> longer than it should. >> The patch proposes that when we get an instance of java.lang.Error, we >> should do a system.exit to fast-fail the process. With the possible >> exception of ThreadDeath (which may or may not be an unrecoverable >>system >> state depending on the thread), I think this makes sense, but I would >>like >> to hear from others if they have an opinion. I think it's better to kill >> the process and let your monitoring services detect process death (and >>thus >> restart) than possibly linger unresponsive for a while, are there >>scenarios >> that we're missing where this error can occur and you wouldn't want the >> process killed? >> >> Thanks for your feedback, >> >> Camille >>
