I've noticed a bug in my implementation of the durable ref, or "dref" construct. This is an attempt to add durability to the STM. drefs have all of the ACI guarantees of refs, with the additional property of being durable, so changes to the state of the identity are durable. The API is that there exists a logical namespace keyed by store name - a string - and a key - an object, which then corresponds to a value. Calling (dref "store" "key" initVal) creates a durable ref with the state of initVal under the store name "store" and key "key", which is immediately written to the underlying BDB JE database. Subsequent invocations of (dref) with the same store and key refer to the same logical object; initVal is ignored. This is also true if the subsequent (dref) call is in a later JVM instance; initVal is ignored and the durable value saved in the BDB JE database is used. (dref) is therefore a drop-in replacement for (ref).
This enhancement to the Clojure core is accessible here: git://github.com/kwanalyssa/clojure.git. There's some slight enhancements since I last ANN'd this: the rest of the Clojure primitives are now supported: BigDecimals and Ratios; and any object type acceptable as values are also acceptable as keys. This opens up the possibility of easily chaining key-value pairs, and having built in, transactional, durable memoization. I'm currently working on adding the persistent data structures such that they're also persistent on disk, minimizing I/O; and making the persistent structures lazily-loaded and evicted. This means potentially having petabyte-size maps occupying a fraction of that in memory and seamlessing swapping sections of the keyset and attached entryset based on usage. The possibilities for multidimensional databases (the use case I'm enhancing Clojure to serve), or even simply as massive, transactional, memoized caches, is awesome. But there is a bug. It has to do with creating/declaring durable refs inside of dosync blocks. To wit: (dosync (def a (dref "store" "key" 0)) (ref-set a (/ 1 0))). This block will obviously fail (i.e. throw an exception). What then, is the var a referring to? It must refer to the dref that was declared/created. But what is the state of that dref identity? In the non-durable version: (dosync (def a (ref 0)) (ref-set a (/ 1 0))), the var a points to a ref with a state of 0. That's because the ref was implicitly created in it's own single-operation transaction (like a DB autocommit) that precedes the transaction declared by the (dosync). It is possible because the ref has 2 properties: it is local, nothing outside the dosync block can concurrently refer to the ref - the var a is ThreadLocal; and the ref is obviously being created, it is not pointing to a ref that exists before the dosync block. However, the durable version doesn't have those two properties. Because the dref namespace is global, another transaction concurrently can be referring to that dref: either a single-operation create/ declare - (dref "store" "key" aDifferentInitVal), or the same create/ declare in a different dosync block. Also, the (dref) statement may refer to a dref that already exists in the store: the (dref) isn't creating something new, but refers to something already in the DB, in which case the initVal is ignored. This makes the transaction semantics much harder. In my current implementation, I punt on some of these issues. Imagine a scenario where there are 2 vars pointing to the same logical dref. var a is pointing to (dref "store" "key" initVal) and so is var b. dosync block A containing an (alter) on var a is executing concurrently as dosync block B, which contains an (alter) on var b. Although these may be two different POJOs, they have the same store and key, so they are the same logical object. Maintaining ACI guarantees means that acquiring a read lock on the identity referred to by var a must acquire the same lock on the identity object referred to by var b. The same is true with write locks. If dosync block A already has a write lock on the logical dref under both var a and var b, dosync block B must back off and retry after dosync block A either finishes or is cancelled/errored out for good. This is a property of having multiple POJOs potentially pointing to the same logical identity because of a global namespace. I punt on this issue by ensuring that all drefs declared with "store" "key" have the same POJO by using a weak cache. Because they have the same POJO, a lock on one (dref) object is a lock on all. But this leads to a problem. To illustrate, imagine 3 scenarios: In the first scenario, the dref doesn't exist in the DB. The (dref) statement is creating a new dref. There are two dosync blocks concurrently executing: dosync block A and dosync block B. dosync block A just happens to begin executing before dosync block B. However, the (dref) statement in dosync A just happens to execute after the (dref) statement in dosync block B. Because the dref statement in dosync block B is executed first in time, the dref is created using the initVal provided in the dosync block B call. However, the dosync block A began before dosync block B. So when dosync block A commits, it has values based on ensure-reads (including alters) on identities with states before dosync block B began. So shouldn't the dref have the initVal in dosync block A, not dosync block B, to be consistent? That's if both blocks succeed without retrying. In the second scenario, dosync block B fails. Then the dref should definitely have the initVal in dosync block A. In the third scenario, both blocks succeed, but dosync block B has to retry, so that it logically ends up executing after dosync block A. Well, the dref is already in cache with the initVal in dosync block B. Like in scenario 2, it should have the initVal in dosync block A. Besides those 3 scenarios, the following scenarios are unclear: What if the dref already exists in the DB, just not in the cache, so a POJO must be instantiated with the state from disk? If dosync block A begins before dosync block B, the dref in B executes first, and they both succeed, both vie for locks on the same POJO, since the (dref) statement in dosync block A should get the same POJO from cache and the blocks will vie for locks on the same object. If the order is the same but block B fails and block A succeeds, A gets all of the locks (potentially after a retry), and all is good. And if the order is the same but B retries first, it releases all of the locks and A gets the locks on that same dref POJO, so everything is fine. I'm finding this terribly hard to reason about, so are these all fine? Finally, what if there's a dref create (i.e. the dref doesn't already exist in the DB), and the dosync rolls back: (dosync (def a (dref "store" "key" 0)) (ref-set a (/ 1 0))). In this example, does the dref exist with the initVal of 0? Shouldn't it not exist so that a subsequent (dref) call will not have its initVal ignored? After all, a clean rollback should de-initialize the dref. And what if var a gets (deref)'ed before it's (ref-set) in some subsequent dosync block? If the dref should indeed not exist, shouldn't the (deref) throw an error? I'll admit that reasoning about concurrency isn't my strong suit, and I have no idea how to write good tests for this. How do we tackle this problem? -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en