The current Jackrabbit clustering doesn't scale well for writes because all cluster nodes use the same persistent storage. Even if persistence storage is clustered, the cluster journal relies on changes being immediately visible in all nodes. That means Jackrabbit clustering can scale well for reads, however it can't scale well for writes. This is a property Jackrabbit clustering shares with most clustering solutions for relational databases. Still, it would make sense to solve this problem for Jackrabbit 3.
== Current Jackrabbit Clustering == [Cluster Node 1] <--> | Shared [Cluster Node 2] <--> | Storage I propose a different architecture in Jackrabbit 3: == Jackrabbit 3 Clustering == [Cluster Node 1] <--> [ Local Storage ] [Cluster Node 2] <--> [ Local Storage ] Please note that shared node storage is still supported for things like the data store, but no longer required or supported for the persistent storage (currently called PersistenceManager). Instead, the cluster nodes should merge each others changes asynchronously (except operations like JCR locking, plus potentially other operations that are not that common; maybe even node move). With "asynchronously" I mean usually within a second or so, but potentially minutes later depending on configuration, latency between cluster nodes, and possibly load. Similar to NoSQL systems. == Unique Change Set Ids == For my idea to work, we need globally unique change set ids. Each change set is stored in the event journal, and can be retrieved later and sent to other cluster nodes. I suggest that events are grouped into change sets so that all events within the same session.save() operation have the same change set id. We could also call it transaction id (I don't mind). Change set ids need to be unique across all cluster nodes. That means, the change set id could be: changeSetId = nanosecondsSince1970 * totalClusterNodes + clusterNodeId Let's say if you have 2 cluster nodes currently and expect to add a few more later (up to 10), you could use the formula: changeSetId = nanosecondsSince1970 * 10 + clusterNodeId To support more than 10 cluster nodes the formula would need to be changed (that could be done at runtime). It doesn't necessarily need to be this formula, but the change set id should represent the time when the change occurred, and it should be unique. == How to Merge Changes == Changes need to be merged so that all cluster nodes end up with the same data (you could call this "eventually consistent"). New changes are not problematic can be applied directly. This includes local changes of course, because the change set id of local changes is always newer than the last change. Changes with change set ids in the future are delayed. Cluster nodes should have reasonably synchronized clocks (it doesn't need to be completely exact, but it should be reasonably accurate, so that such delayed events are not that common). So the only tricky thing are changes that happened in the past, in another cluster node, if the same data was changed in this cluster node (or another cluster node) afterwards (afterwards mean with a higher change set id). To find out that a change happened in the past, each node needs to at least know the change set id of the last change. There are multiple solutions: == Solution A: Node Granularity, Ignore Old Changes == Here, each node only need to know when it was changed the last time. If the change set id is older than that, changes to its properties and child node list are ignored. That means, if two cluster nodes concurrently change data in a node, the newer change wins, and the older change is lost. This is a bit problematic for example when concurrently adding child nodes: Only the added child node of the newer change survives, which is probably unexpected. == Solution B: Merge Old Changes == Here, we need an efficient way to load the list of changes (events) to a node since a certain time. Now, when merging a change, the old versions of the node need to be loaded or re-constructed, and then the old change needs to be applied as if it already happened before the newer change. Let's say we know about the two versions: v1: node a; child nodes b, c, d; properties x=1, y=2 event t9: add child node e, set property x=2, remove property y v9: node a; child nodes b, c, d, e; properties x=2 The change to merge happend in the past: event t3: add child node f, remove child node b, set property y=3, remove property x, set property z=1 Now the result would be: v9(new): node a, child nodes c, d, e, f; properties x=2, z=1 There are other ways to merge the changes of course (for example, only merge added child / removed child nodes). I think there are some tricky problems, however I think it's relatively easy to ensure the algorithm is correct using a few randomized test cases. No matter what the merge rules are, they would need to be constructed so that at the end of the day, each cluster node would end up with the exact same data, for all possible combination and order of changes. Regards, Thomas