[jr3] Node Identifiers / Corresponding Nodes

Thomas Müller Sun, 18 Apr 2010 02:37:32 -0700

== Node Identifier Format ==

Jackrabbit node ids are currently UUIDs. For Jackrabbit 3, I think
that embedded storage mechanisms should use a long sequence instead.
Advantages of sequences: faster to generate (nodeId = nextId++);
faster index lookup (nodes generated at around the same time have
similar ids, which improves index efficiency); needs less space
(specially when using a variable size format; see [1]). Advantages of
UUIDs: allows distributed creating of nodes. That's why the Jackrabbit
3 data format should support UUIDs as node ids: for cloud storage
mechanisms.


== JCR Node Identifier versus Internal Unique Node ID ==

The JCR API requires that corresponding nodes of different workspaces
have the same JCR identifier. The current Jackrabbit stores each
workspace separately, so that's not a problem. With Jackrabbit 3, I
would like to combine the storage of all workspaces. The problem is
that JCR node identifiers can no longer be equal the internal unique
node id. For efficient storage, the internal node id should be the
combination of the workspace id and the JCR node identifier.

One solution is: long internalUniqueNodeId = (workspaceId << 40) +
jcrNodeIdentifier. The problem is: node ids in workspaces other than
workspace #0 need quite a lot of space when using a variable size
format.

My proposal is: store the workspace id at the end of the JCR node
identifier, using a variable size format (see [1]). I think in most
cases there is only 1 workspace (workspace #0). The second important
case is fewer than 16 workspaces. I suggest to support the following 4
cases:

* workspace #0: the node ids end with bit 0:
  internalUniqueNodeId = jcrNodeIdentifier << 1

* workspaces #1-#15: node ids end with the bits 01:
  (jcrNodeIdentifier << 6) + (workspaceId << 2) + 1

* workspaces #16-#2047: node ids end with 011:
  (jcrNodeIdentifier << 14) + (workspaceId << 3) + 3

* workspace #2048-#268'435'455: ids end with 0111:
  (jcrNodeIdentifier << 32) + (workspaceId << 4) + 7

* workspace #268'435'455 and larger are not supported.

What do you think, do those constants make sense?

[1] The variable size int / long formats are used in various open
source projects such as Apache Lucene, SQLite, H2 Database Engine,
Google Protocol Buffers. It is somewhat similar to UTF-8 encoding. See
also:
http://code.google.com/p/h2database/source/browse/trunk/h2/src/main/org/h2/store/Data.java#989
http://en.wikipedia.org/wiki/Golomb_coding

== Node Without ID ==

The Jackrabbit 3 data format should support storing nodes embedded
within the parent node. The advantage is: such embedded nodes would be
stored next to each other, possibly improving read performance, and
maybe reducing storage space (both needs to be tested). The identifier
of such embedded nodes would be unique, but not stable.

Regards,
Thomas

[jr3] Node Identifiers / Corresponding Nodes

Reply via email to