Hi, While I agree MVCC and clustering are important, but I came to the conclusion that those do not require a content addressable storage.
>svn and git. My implementation is modeled after relational and NoSQL databases. Databases are optimized for fine grained content (rows), which I believe matches quite well with what we do on a node level. An important exception is binary data, where we use a content addressable storage (the data store), which I think is appropriate. >my assumption was that our clustering implementation could leverage the >content-addressed model While I agree the content hash can be used to efficiently sync remote (sub)trees, I believe it is not required to use the content hash as the node id. Instead, the content hash can be stored as a property, or as a *part* of the node id (not the only part of the node id). >as long as we don't have a clear idea of how to support clustering i am >rather >reluctant to already give up on the content-addressable model. I will not ask you to give up on your model, but please don't ask me to give up my model :-) My view for clustering is: I believe we should have a look how other solutions work, specially NoSQL databases. So far I am not aware of a NoSQL database that uses a content addressable storage (unless you view git and svn as NoSQL databases). I believe we should build clustering based on two mechanisms: * Virtual repository to distribute (shard) the data. Please note the content hash will not help here. * For data that is stored in multiple repositories, use a synchronization mechanism. This can be achieved using the journal, or using the content hash, or both. Both concepts do not require a content addressable storage. >supporting flat hierarchies. >since we can't assume that child node names do follow a specific >pattern (e.g. n1-n99999999) i don't follow your performance-related >reasoning. If people care about performance, they will use patterns. If performance is important, then patterns are required. After playing around with bloom filters, I came to the conclusion that there simply is no efficient way to index randomly distributed data on disk. >i've considered storing the diff of a commit in the revision a while ago. >while it would be relatively easy to implement i currently don't see an >immediate benefit compared to diffing which OTOH is very efficient >thanks to the content-addressable model. While efficient diffing requires a content hash, it does not require a content addressable storage. The main advantage to storing the commit is that the implementation is simpler. There are advantages and disadvantages to diffing. The advantage is that the journal doesn't need to be stored on disk. The disadvantage is that the journal is not always as it was originally, and the implementation is more complex. >i imagine that in a clustering scenario there's a need to compute >the changes between 2 non-consecutive revisions, potentially >spanning a large number of intermediate revisions. just diffing >2 revsisions is IMO probably more efficient than reconstructing >the changes from a large number of revisions. That is true. I think it would be an advantage to implement the diffing on a higher level, that is, above the MicroKernel API. That would also allow synchronizing repositories using the MicroKernel API. Regards, Thomas
