[jr3] Clustering: Scalable Writes / Asynchronous Change Merging

Thomas Müller Tue, 19 Oct 2010 03:24:52 -0700

The current Jackrabbit clustering doesn't scale well for writes
because all cluster nodes use the same persistent storage. Even if
persistence storage is clustered, the cluster journal relies on
changes being immediately visible in all nodes. That means Jackrabbit
clustering can scale well for reads, however it can't scale well for
writes. This is a property Jackrabbit clustering shares with most
clustering solutions for relational databases. Still, it would make
sense to solve this problem for Jackrabbit 3.


== Current Jackrabbit Clustering ==

[Cluster Node 1]  <--> | Shared
[Cluster Node 2]  <--> | Storage

I propose a different architecture in Jackrabbit 3:

== Jackrabbit 3 Clustering ==

[Cluster Node 1]  <-->  [ Local Storage ]
[Cluster Node 2]  <-->  [ Local Storage ]

Please note that shared node storage is still supported for things
like the data store, but no longer required or supported for the
persistent storage (currently called PersistenceManager).

Instead, the cluster nodes should merge each others changes
asynchronously (except operations like JCR locking, plus potentially
other operations that are not that common; maybe even node move). With
"asynchronously" I mean usually within a second or so, but potentially
minutes later depending on configuration, latency between cluster
nodes, and possibly load. Similar to NoSQL systems.

== Unique Change Set Ids ==

For my idea to work, we need globally unique change set ids. Each
change set is stored in the event journal, and can be retrieved later
and sent to other cluster nodes. I suggest that events are grouped
into change sets so that all events within the same session.save()
operation have the same change set id. We could also call it
transaction id (I don't mind). Change set ids need to be unique across
all cluster nodes. That means, the change set id could be:

changeSetId = nanosecondsSince1970 * totalClusterNodes + clusterNodeId

Let's say if you have 2 cluster nodes currently and expect to add a
few more later (up to 10), you could use the formula:

changeSetId = nanosecondsSince1970 * 10 + clusterNodeId

To support more than 10 cluster nodes the formula would need to be
changed (that could be done at runtime). It doesn't necessarily need
to be this formula, but the change set id should represent the time
when the change occurred, and it should be unique.

== How to Merge Changes ==

Changes need to be merged so that all cluster nodes end up with the
same data (you could call this "eventually consistent").

New changes are not problematic can be applied directly. This includes
local changes of course, because the change set id of local changes is
always newer than the last change.

Changes with change set ids in the future are delayed. Cluster nodes
should have reasonably synchronized clocks (it doesn't need to be
completely exact, but it should be reasonably accurate, so that such
delayed events are not that common).

So the only tricky thing are changes that happened in the past, in
another cluster node, if the same data was changed in this cluster
node (or another cluster node) afterwards (afterwards mean with a
higher change set id). To find out that a change happened in the past,
each node needs to at least know the change set id of the last change.
There are multiple solutions:

== Solution A: Node Granularity, Ignore Old Changes ==

Here, each node only need to know when it was changed the last time.
If the change set id is older than that, changes to its properties and
child node list are ignored. That means, if two cluster nodes
concurrently change data in a node, the newer change wins, and the
older change is lost. This is a bit problematic for example when
concurrently adding child nodes: Only the added child node of the
newer change survives, which is probably unexpected.

== Solution B: Merge Old Changes ==

Here, we need an efficient way to load the list of changes (events) to
a node since a certain time. Now, when merging a change, the old
versions of the node need to be loaded or re-constructed, and then the
old change needs to be applied as if it already happened before the
newer change. Let's say we know about the two versions:

v1: node a; child nodes b, c, d; properties x=1, y=2
event t9: add child node e, set property x=2, remove property y
v9: node a; child nodes b, c, d, e; properties x=2

The change to merge happend in the past:

event t3: add child node f, remove child node b, set property y=3,
remove property x, set property z=1

Now the result would be:

v9(new): node a, child nodes c, d, e, f; properties x=2, z=1

There are other ways to merge the changes of course (for example, only
merge added child / removed child nodes). I think there are some
tricky problems, however I think it's relatively easy to ensure the
algorithm is correct using a few randomized test cases. No matter what
the merge rules are, they would need to be constructed so that at the
end of the day, each cluster node would end up with the exact same
data, for all possible combination and order of changes.

Regards,
Thomas

[jr3] Clustering: Scalable Writes / Asynchronous Change Merging

Reply via email to