[
https://issues.apache.org/jira/browse/OAK-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davide Giannella updated OAK-4833:
----------------------------------
Fix Version/s: 1.6
> Document storage format changes
> -------------------------------
>
> Key: OAK-4833
> URL: https://issues.apache.org/jira/browse/OAK-4833
> Project: Jackrabbit Oak
> Issue Type: Technical task
> Components: doc, segment-tar
> Reporter: Michael Dürig
> Assignee: Michael Dürig
> Labels: documentation
> Fix For: 1.6, 1.5.17
>
>
> This issue serves as collection of all changes to the storage format
> introduced with Oak Segment Tar and their impact. Once sufficiently
> stabilised this information should serve as basis for the documentation in
> {{oak-doc}}.
> || Change || Rational || Impact || Migration || Since || Issues ||
> |Generation in segment header |Required to unequivocally determine the
> generation of a segment during cleanup. Segment retention time is given in
> number of generations (2 by default). |No performance, space impact expected
> |offline |0.0.2 |OAK-3348 |
> |Stable id for node states |Required to efficiently determine equality of
> node states. This can be seen as an intermediate step to decoupling the
> address of records from their identity. The next step is to introduce logical
> record ids (OAK-4659). |Node states increase by the size of one record id (3
> bytes / 20 bytes after OAK-4631). On top of that there is an additional block
> record à 18 bytes per node state. |offline |0.0.2 |OAK-3348
> |Binary index in tar files |Avoid traversing the repository to collect the gc
> roots for DSGC. Fetch them from an index instead. |Additional index entry per
> tar file. Adds a couple of bytes per external binary to each tar file. Exact
> size to be determined. [~frm] could you help with this? OAK-4740 is a
> regression wrt. to resiliency caused by this change (and the fact that the
> blob store might return blob ids longer than 2k chars). |offline |0.0.4
> |OAK-4101
> |Simplified record ids |Preparation and precondition for logical record ids
> (OAK-4659). At the same time the simplest possible fix for OAK-2896. The
> latter leads to degeneration of segment sizes, which in turn has adverse
> effects on overall performance, resource utilisation and memory requirements.
> Without this fix OAK-2498 would need to be fixed in a different way that
> would require other changes in the storage format. I started to regard this
> issue as removing a premature optimisation (which caused OAK-2498). OTOH with
> OAK-4844 we should also start looking into mitigations and what those would
> mean to size vs. simplicity vs. performance. |Record ids grow from 3 bytes
> to 18 bytes when serialised into records. Impact on repositories to be
> assessed but can be anywhere between almost none to x6. OAK-4812 is a
> performance regression caused by this chance. Its overall impact is yet to be
> assessed. |offline |0.0.10 |OAK-4631, OAK-4844
> |Storage format versioning |In order to be able to further evolve the storage
> format with minimal impact on existing deployments we need to carefully
> versions the various storage entities (segments, tar files, etc.) |No
> performance, space impact expected |offline |0.0.2/ 0.0.10 |OAK-4232,
> OAK-4683, OAK-4295
> |Logical record ids |We need to separate addresses of records from their
> identity to be able to further scale the TarMK. OAK-3348 (the online
> compaction misery) can be seen as a symptom of failing to understand this
> earlier. The stable ids introduced with OAK-3348 are a first step into this
> direction. However this is not sufficient to implement features like e.g.
> background compaction (OAK-4756), partial compaction (OAK-3349) or
> incremental compaction (OAK-3350). |A small size overhead per segment for
> the logical id table. Further impact to be evaluated ([~frm], please add your
> assessment here). |offline |0.0.14 (planned) |OAK-4659
> |External index for segments |Avoid recreating tar files if indexes are
> corrupt/missing. Just recreate the indexes. |Faster startup after a crash.
> Overall less disk space usage as no unnecessary backup files are created.
> |online |not yet planned |OAK-4649
> |In-place journal |Reduce complexity by in-lining the journal log. Less
> files, less chances to break something. Also the granularity of the log would
> increase as flushing of the persisted head would not be required any more.
> Resilience would improve as the roll-back functionality could operate at a
> finer granularity. |No more journal.log. Better resiliency. Significant risk
> for regression of OAK-4291 if not implemented properly. Most likely a
> significant refactoring of some parts of the code is required before we can
> proceed with this issue. |online |not yet planned |OAK-4103
> |Root record types |With the information currently available from the segment
> headers we cannot collect statistics about segment usage on repositories of
> non trivial sizes. This fix would allow us to build more scalable tools to
> that respect. |None expected wrt. to performance and size under normal
> operation. |offline |0.0.14 (planned) (waiting for OAK-4659 as implementation
> depends on how we progress there) |OAK-2498
> Misc ideas currently on the back burner:
> * SegmentMK: Arch segments (OAK-1905)
> * Extension headers for segments (no issue yet)
> * More memory efficient serialisation of values (e.g. boolean) (no issue yet)
> * Protocol Buffer for serialising records (no issue yet)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)