Hi Jukka,

how about a mix of both, where both engines contribute what they can do best.

the 'journal' engine is IMO best suited for a large number of unique and small records. large number because it requires less individual files, compared to the 'store'. unique because it cannot detect duplicate records and small because it is not worth trying to maintain unique small records, this seems only worth the price for larger records. thus, I would use the 'journal' engine for structural data (node hierarchy) and small properties.

the 'store' engine is able to detect duplicate records, this is only useful when disk saving is significant and compensates the cost for maintaining the hashes. because the 'store' uses individual files, access to those files should not be excessive. (open, read, close compared to seek, read in a 'journal'). the 'store' seems to be best suited for larger records that may occur multiple times in a repository.

using the current jackrabbit architecture I would probably use the 'store' as a DataStore implementation and the 'journal' as an efficient persistence manager.

I think the locality you mentioned is also a very interesting topic we might want to further investigate. e.g. r-trees, but that probably means a journaling approach is not possible anymore.

regards
 marcel

Jukka Zitting wrote:
Hi,

The current jackrabbit-ngp sandbox contains two storage engines,
"journal" and "store", that are basically similar in that both store
immutable binary records and that the storage engine is in charge of
assigning identifiers to stored records. I think that this general
storage model works very well with the NGP architecture as currently
outlined.

The crucial difference between the two engines is in how they organize
the records and what kind of identifiers they use. The "journal"
engine stores records in an append-only journal file (with a few
exceptions for very small and very large records) and uses the
location of the record within the journal as the record identifier.
The "store" engine stores records as individual files named and
identified using the SHA-1 hash of the record.

The location-based addressing used by the "journal" engine is more
performance- and space-optimized as the identifiers are compact (64
bits in current version) and accessing any record is at most a single
disk read away. The engine also has good locality of reference as
content created or modified in the same revision will be stored on
adjacent disk locations.

The content-based addressing used by the "store" engine is easier to
implement, maps well to a distributed/clustered model, and is very
easy to cache. However, the identifiers are larger (160 bits in
current version) and accessing a record requires at least one index
lookup on normal hardware. The engine will also automatically optimize
storage of duplicate content even if no copy or versioning operations
were used to create the duplicates.

Both models have their benefits and drawbacks, and I haven't yet come
to a conclusion which (if either) would be the best storage model for
NGP. In this post I wanted to outline my current thinking on the
matter and open up discussion for any comments or insights I'm
probably missing.

BR,

Jukka Zitting


Reply via email to