== Node Identifier Format == Jackrabbit node ids are currently UUIDs. For Jackrabbit 3, I think that embedded storage mechanisms should use a long sequence instead. Advantages of sequences: faster to generate (nodeId = nextId++); faster index lookup (nodes generated at around the same time have similar ids, which improves index efficiency); needs less space (specially when using a variable size format; see [1]). Advantages of UUIDs: allows distributed creating of nodes. That's why the Jackrabbit 3 data format should support UUIDs as node ids: for cloud storage mechanisms.
== JCR Node Identifier versus Internal Unique Node ID == The JCR API requires that corresponding nodes of different workspaces have the same JCR identifier. The current Jackrabbit stores each workspace separately, so that's not a problem. With Jackrabbit 3, I would like to combine the storage of all workspaces. The problem is that JCR node identifiers can no longer be equal the internal unique node id. For efficient storage, the internal node id should be the combination of the workspace id and the JCR node identifier. One solution is: long internalUniqueNodeId = (workspaceId << 40) + jcrNodeIdentifier. The problem is: node ids in workspaces other than workspace #0 need quite a lot of space when using a variable size format. My proposal is: store the workspace id at the end of the JCR node identifier, using a variable size format (see [1]). I think in most cases there is only 1 workspace (workspace #0). The second important case is fewer than 16 workspaces. I suggest to support the following 4 cases: * workspace #0: the node ids end with bit 0: internalUniqueNodeId = jcrNodeIdentifier << 1 * workspaces #1-#15: node ids end with the bits 01: (jcrNodeIdentifier << 6) + (workspaceId << 2) + 1 * workspaces #16-#2047: node ids end with 011: (jcrNodeIdentifier << 14) + (workspaceId << 3) + 3 * workspace #2048-#268'435'455: ids end with 0111: (jcrNodeIdentifier << 32) + (workspaceId << 4) + 7 * workspace #268'435'455 and larger are not supported. What do you think, do those constants make sense? [1] The variable size int / long formats are used in various open source projects such as Apache Lucene, SQLite, H2 Database Engine, Google Protocol Buffers. It is somewhat similar to UTF-8 encoding. See also: http://code.google.com/p/h2database/source/browse/trunk/h2/src/main/org/h2/store/Data.java#989 http://en.wikipedia.org/wiki/Golomb_coding == Node Without ID == The Jackrabbit 3 data format should support storing nodes embedded within the parent node. The advantage is: such embedded nodes would be stored next to each other, possibly improving read performance, and maybe reducing storage space (both needs to be tested). The identifier of such embedded nodes would be unique, but not stable. Regards, Thomas
