hi jukka very interesting! :) a couple of random questions follow inline...
On Jan 14, 2008 2:31 PM, Jukka Zitting <[EMAIL PROTECTED]> wrote: > Hi, > > With the recent NGP interest I wanted to push some of my latest > prototype work to the jackrabbit-ngp sandbox. Perhaps the most notable > (though not very fleshed out) concept is the simplified storage > mechanism that I plan to try out. Here's a quick summary of how I see > it working. > > The storage model is similar to the DataStore concept in > jackrabbit-core. All content is stored in separate "records" that are > basically just immutable blobs identified by their SHA-1 checksums. > > All nodes are serialized to a binary representation and stored as > immutable records in the system. The SHA-1 record checksum is used as > the internal node identifier instead of an explicitly assigned UUID. A > parent node contains the names and SHA-1 record checksums of all the > child nodes. what about the properties? > > As an example, consider a simple content tree with four nodes: the > root node, "foo", "bar", and "baz". The "bar" node is a child of > "foo", and "foo" and "baz" are children of the root node. In path > notation: > > / > /foo > /foo/bar > /baz > > The "bar" and "baz" nodes are empty, and could be represented by an > empty record, with SHA-1 checksum X. The "foo" node has "bar" > (checksum X) as a child, so could have a binary representation like > ["bar"=X], with checksum Y. The root node has "foo" (checksum Y) and > "baz" (checksum X) as child nodes, and could be represented as > ["foo=:Y,"baz"=X], with checksum Z. The repository would then contain > the following three records and some metadata that marks record Z as > the root node. > > X: [] > Y: ["bar"=X] > Z: ["foo"=:Y,"baz"=X] > root => Z > > A revision that adds an empty "new" node to "/foo/new", would result > in "foo" getting a new record ["bar"=X,"new"=X] (checksum P) and the > root node becoming ["foo"=P,"baz"=X] (checksum Q). The repository > would then be: > > X: [] > Y: ["bar"=X] > Z: ["foo"=:Y,"baz"=X] > P: ["bar"=:X,"new"=X] > Q: ["foo"=:P,"baz"=X] > root => Q > > A session that was opened before this change could still continue > accessing the repository with record Z as the root node until the > session is either explicitly or implicitly refreshed to the latest > state. Once all clients have stopped referring to Z as the root node, > a garbage collector could reduce the repository to: > > X: [] > P: ["bar"=:X,"new"=X] > Q: ["foo"=:P,"baz"=X] > root => Q > > The only synchronization point in this scheme would be changing the > root pointer to a more recent version of the root node. A client that > wants to persist a new revision, can store all the records included in > the revision, perform any required consistency checks, and finally > update the root pointer to the validated new root record. Almost all > of this can be done in parallel with other clients, only when changing > the root pointer the client needs to verify that nobody else has > meanwhile updated the root pointer. If the root pointer has changed, > the client needs to repeat any merging and validation steps before > retrying the update. In typical scenarios such write conflicts should > be relatively rare. > > There are some notable implications of such a storage model: > > Parent references are not stored anywhere, which means that for each > accessed node all the ancestor nodes must also be accessed. This is a > requirement in any case if we want to enforce hierarchical access > controls or or other policies. how would you build the path of a node accessed by uuid? cheers stefan > > Explicit UUIDs are stored as literal jcr:uuid properties and REFERENCE > properties are just specially typed string properties. Indexing is > used to speed up getNodeByUUID() lookups, making getNodeByUUID > essentially equivalent to an XPath query like //[EMAIL PROTECTED]:uuid='...']. > Referential integrity is handled explicitly on a higher level. Because > of this hard references and direct UUID access will likely worse than > in current jackrabbit-core, but to me that's a conscious design > tradeoff. > > To make queries work properly for clients that use any past version of > the root node, search indexes should be stored as a part of the > content tree instead of outside it. This way a content update will > always include the respective index updates. To best reuse our current > query engine, I would store the index files within a special > /rep:index node. Lucene's segment file model should work well with > immutable records. > > This storage model is quite simple to implement on the file system and > there's also a trivial mapping to HTTP. In fact any web server that > supports the GET and PUT methods and the ETag, If-Match, and > If-None-Match headers should be directly usable as a backend for this > storage model. Such record resources would also be trivially > cacheable. > > BR, > > Jukka Zitting >
