Hi, The current jackrabbit-ngp sandbox contains two storage engines, "journal" and "store", that are basically similar in that both store immutable binary records and that the storage engine is in charge of assigning identifiers to stored records. I think that this general storage model works very well with the NGP architecture as currently outlined.
The crucial difference between the two engines is in how they organize the records and what kind of identifiers they use. The "journal" engine stores records in an append-only journal file (with a few exceptions for very small and very large records) and uses the location of the record within the journal as the record identifier. The "store" engine stores records as individual files named and identified using the SHA-1 hash of the record. The location-based addressing used by the "journal" engine is more performance- and space-optimized as the identifiers are compact (64 bits in current version) and accessing any record is at most a single disk read away. The engine also has good locality of reference as content created or modified in the same revision will be stored on adjacent disk locations. The content-based addressing used by the "store" engine is easier to implement, maps well to a distributed/clustered model, and is very easy to cache. However, the identifiers are larger (160 bits in current version) and accessing a record requires at least one index lookup on normal hardware. The engine will also automatically optimize storage of duplicate content even if no copy or versioning operations were used to create the duplicates. Both models have their benefits and drawbacks, and I haven't yet come to a conclusion which (if either) would be the best storage model for NGP. In this post I wanted to outline my current thinking on the matter and open up discussion for any comments or insights I'm probably missing. BR, Jukka Zitting
