Hi Jukka, how about a mix of both, where both engines contribute what they can do best.
the 'journal' engine is IMO best suited for a large number of unique and small records. large number because it requires less individual files, compared to the 'store'. unique because it cannot detect duplicate records and small because it is not worth trying to maintain unique small records, this seems only worth the price for larger records. thus, I would use the 'journal' engine for structural data (node hierarchy) and small properties.
the 'store' engine is able to detect duplicate records, this is only useful when disk saving is significant and compensates the cost for maintaining the hashes. because the 'store' uses individual files, access to those files should not be excessive. (open, read, close compared to seek, read in a 'journal'). the 'store' seems to be best suited for larger records that may occur multiple times in a repository.
using the current jackrabbit architecture I would probably use the 'store' as a DataStore implementation and the 'journal' as an efficient persistence manager.
I think the locality you mentioned is also a very interesting topic we might want to further investigate. e.g. r-trees, but that probably means a journaling approach is not possible anymore.
regards marcel Jukka Zitting wrote:
Hi, The current jackrabbit-ngp sandbox contains two storage engines, "journal" and "store", that are basically similar in that both store immutable binary records and that the storage engine is in charge of assigning identifiers to stored records. I think that this general storage model works very well with the NGP architecture as currently outlined. The crucial difference between the two engines is in how they organize the records and what kind of identifiers they use. The "journal" engine stores records in an append-only journal file (with a few exceptions for very small and very large records) and uses the location of the record within the journal as the record identifier. The "store" engine stores records as individual files named and identified using the SHA-1 hash of the record. The location-based addressing used by the "journal" engine is more performance- and space-optimized as the identifiers are compact (64 bits in current version) and accessing any record is at most a single disk read away. The engine also has good locality of reference as content created or modified in the same revision will be stored on adjacent disk locations. The content-based addressing used by the "store" engine is easier to implement, maps well to a distributed/clustered model, and is very easy to cache. However, the identifiers are larger (160 bits in current version) and accessing a record requires at least one index lookup on normal hardware. The engine will also automatically optimize storage of duplicate content even if no copy or versioning operations were used to create the duplicates. Both models have their benefits and drawbacks, and I haven't yet come to a conclusion which (if either) would be the best storage model for NGP. In this post I wanted to outline my current thinking on the matter and open up discussion for any comments or insights I'm probably missing. BR, Jukka Zitting
