<background> I work on the ActiveMQ project which implements the JMS API - which is a kinda complex thing but it involves a number of objects (Connections, Sessions, Producers, Consumers). In some JMS providers its the end users responsibility to deal with detecting a connection failure (from any other kind of error) and then automatically recreating all the dependent objects.
We added support for auto-reconnection which greatly simplifies the developers life; it lets the JMS client automatically deal with any socket failures, reconnecting to a broker for you and re-establishing all of those in-flight operations (subscriptions, in progress sends and so forth). http://activemq.apache.org/how-can-i-support-auto-reconnection.html Having seen the value of wrapping up the auto-reconnection within a proxy; am thinking its also got merits on ZK </background> As we start creating protocols/recipes that implement higher order features like locks, leader elections and so forth we could probably do with some kinda auto-reconnecting facade to ZooKeeper just to simplify the implementation code of protocols/recipes. Its a kinda complex area though and I'm sure different protocols will want different things; but even for something so simple as a lock - I can see the value in an auto-reconnecting proxy. e.g. there's already 5 different method calls in the current WriteLock implementation which all really need a custom try/catch around them to detect loss of the connection which then should be wrapped in a reconnect-retry logic. What to do about watches is interesting; though for now the current behaviour seems fine (fire them all forcing a re-watch) though we could though in the future re-enable watches in the new server connection as an option. All I'm thinking about for now is a kinda ReconnectingZooKeeper which looks like a ZooKeeper object but which internally catches dead connections and then internally tries to reconnect to one of the ZK servers under the covers - retrying the current read/write operation until the ReconnectPolicy says to fail. e.g. some folks might wanna retry connecting forever; others for a certain amount of time or certain number of attempts etc. So something like... public class ReconnectingZooKeeper extends ZooKeeper { ... // for each method that reads/writes synchronously public Stat exists(String path) {... boolean retry = true; for (int count = 0; retry; count++ ) { try { // really do the method call! return super.exists(path); } catch (ConnectionClosedException e) { // lets let any watches or listeners respond to connection loss first before we retry fireAnyWatchesAndStuff(); if (!shouldRetry(count)) { throw e; } } } Any watches should fire when a connection is lost - and all writes should be replicated to the new server we connect to right? So I'm thinking, if we had a ReconnectingZooKeeper implementation, we could use it with the current WriteLock implementation so that the protocol could survive ZK server loss & reconnection while still working. e.g. on connection loss the leader/lock owner needs to loose the lock until it gets it back just in case; but other than that I think it should work. Am sure there's some gremlins somewhere in automatically reconnecting; though provided the watch mechanism works, clients will be able to do the right thing I think. Thoughts? -- James ------- http://macstrac.blogspot.com/ Open Source Integration http://open.iona.com