[ https://issues.apache.org/jira/browse/ZOOKEEPER-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15632811#comment-15632811 ]
Benjamin Reed commented on ZOOKEEPER-2619: ------------------------------------------ multi will handle some of the use cases, but a simple one that it doesn't handle is if you want to implement swap: zk.getData(znode, ...) zk.setData(znode, ...) you can't do that with multi (and i don't think we should extend multi to do it :) mutli also doesn't handle the case when you are updating lots of data and would go over max packet size. > Client library reconnecting breaks FIFO client order > ---------------------------------------------------- > > Key: ZOOKEEPER-2619 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2619 > Project: ZooKeeper > Issue Type: Bug > Reporter: Diego Ongaro > > According to the USENIX ATC 2010 > [paper|https://www.usenix.org/conference/usenix-atc-10/zookeeper-wait-free-coordination-internet-scale-systems], > ZooKeeper provides "FIFO client order: all requests from a given client are > executed in the order that they were sent by the client." I believe > applications written using the Java client library are unable to rely on this > guarantee, and any current application that does so is broken. Other client > libraries are also likely to be affected. > Consider this application, which is simplified from the algorithm described > on Page 4 (right column) of the paper: > {code} > zk = new ZooKeeper(...) > zk.createAsync("/data-23857", "...", callback) > zk.createSync("/pointer", "/data-23857") > {code} > Assume an empty ZooKeeper database to begin with and no other writers. > Applying the above definition, if the ZooKeeper database contains /pointer, > it must also contain /data-23857. > Now consider this series of unfortunate events: > {code} > zk = new ZooKeeper(...) > // The library establishes a TCP connection. > zk.createAsync("/data-23857", "...", callback) > // The library/kernel closes the TCP connection because it times out, and > // the create of /data-23857 is doomed to fail with ConnectionLoss. Suppose > // that it never reaches the server. > // The library establishes a new TCP connection. > zk.createSync("/pointer", "/data-23857") > // The create of /pointer succeeds. > {code} > That's the problem: subsequent operations get assigned to the new connection > and succeed, while earlier operations fail. > In general, I believe it's impossible to have a system with the following > three properties: > # FIFO client order for asynchronous operations, > # Failing operations when connections are lost, AND > # Transparently reconnecting when connections are lost. > To argue this, consider an application that issues a series of pipelined > operations, then upon noticing a connection loss, issues a series of recovery > operations, repeating the recovery procedure as necessary. If a pipelined > operation fails, all subsequent operations in the pipeline must also fail. > Yet the client must also carry on eventually: the recovery operations cannot > be trivially failed forever. Unfortunately, the client library does not know > where the pipelined operations end and the recovery operations begin. At the > time of a connection loss, subsequent pipelined operations may or may not be > queued in the library; others might be upcoming in the application thread. If > the library re-establishes a connection too early, it will send pipelined > operations out of FIFO client order. > I considered a possible workaround of having the client diligently check its > callbacks and watchers for connection loss events, and do its best to stop > the subsequent pipelined operations at the first sign of a connection loss. > In addition to being a large burden for the application, this does not solve > the problem all the time. In particular, if the callback thread is delayed > significantly (as can happen due to excessive computation or scheduling > hiccups), the application may not learn about the connection loss event until > after the connection has been re-established and after dependent pipelined > operations have already been transmitted over the new connection. > I suggest the following API changes to fix the problem: > - Add a method ZooKeeper.getConnection() returning a ZKConnection object. > ZKConnection would wrap a TCP connection. It would include all synchronous > and asynchronous operations currently defined on the ZooKeeper class. Upon a > connection loss on a ZKConnection, all subsequent operations on the same > ZKConnection would return a Connection Loss error. Upon noticing, the client > would need to call ZooKeeper.getConnection() again to get a working > ZKConnection object, and it would execute its recovery procedure on this new > connection. > - Deprecate all asynchronous methods on the ZooKeeper object. These are > unsafe to use if the caller assumes they're getting FIFO client order. > - No changes to the protocols or servers are required. > I recognize this could cause a lot of code churn for both ZooKeeper and > projects that use it. On the other hand, the existing asynchronous calls in > applications should now be audited anyhow. > The code affected by this issue may be difficult to contain: > - It likely affects all ZooKeeper client libraries that provide both > asynchronous operations and transparent reconnection. That's probably all > versions of the official Java client library, as well as most other client > libraries. > - It affects all applications using those libraries that depend on the FIFO > client order of asynchronous operations. I don't know how common that is, but > the paper implies that FIFO client order is important. > - Fortunately, the issue can only manifest itself when connections are lost > and transparently reestablished. In practice, it may also require a long > pipeline or a significant delay in the application thread while the library > establishes a new connection. > - In case you're wondering, this issue occurred to me while working on a new > client library for Go. I haven't seen this issue in the wild, but I was able > to produce it locally by placing sleep statements in a Java program and > closing its TCP connections. > I'm new to this community, so I'm looking forward to the discussion. Let me > know if I can clarify any of the above. -- This message was sent by Atlassian JIRA (v6.3.4#6332)