Hi,

The current jackrabbit-ngp sandbox contains two storage engines,
"journal" and "store", that are basically similar in that both store
immutable binary records and that the storage engine is in charge of
assigning identifiers to stored records. I think that this general
storage model works very well with the NGP architecture as currently
outlined.

The crucial difference between the two engines is in how they organize
the records and what kind of identifiers they use. The "journal"
engine stores records in an append-only journal file (with a few
exceptions for very small and very large records) and uses the
location of the record within the journal as the record identifier.
The "store" engine stores records as individual files named and
identified using the SHA-1 hash of the record.

The location-based addressing used by the "journal" engine is more
performance- and space-optimized as the identifiers are compact (64
bits in current version) and accessing any record is at most a single
disk read away. The engine also has good locality of reference as
content created or modified in the same revision will be stored on
adjacent disk locations.

The content-based addressing used by the "store" engine is easier to
implement, maps well to a distributed/clustered model, and is very
easy to cache. However, the identifiers are larger (160 bits in
current version) and accessing a record requires at least one index
lookup on normal hardware. The engine will also automatically optimize
storage of duplicate content even if no copy or versioning operations
were used to create the duplicates.

Both models have their benefits and drawbacks, and I haven't yet come
to a conclusion which (if either) would be the best storage model for
NGP. In this post I wanted to outline my current thinking on the
matter and open up discussion for any comments or insights I'm
probably missing.

BR,

Jukka Zitting

Reply via email to