[ https://issues.apache.org/jira/browse/OAK-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744873#comment-15744873 ]
Michael Dürig commented on OAK-4833: ------------------------------------ This issue is/was mainly about collecting all format changes we've done between {{oak-segment}} and {{oak-segment-tar}} in a central place. This should now somehow become part of our (developer) documentation about the tar and segment format (e.g. OAK-4648). Depending on whether we drop the documentation for the now deprecated {{oak-segment}} module right away or later, I think this should either be part of the documentation itself or just linked from it. > Document storage format changes > ------------------------------- > > Key: OAK-4833 > URL: https://issues.apache.org/jira/browse/OAK-4833 > Project: Jackrabbit Oak > Issue Type: Technical task > Components: doc, segment-tar > Reporter: Michael Dürig > Assignee: Michael Dürig > Labels: documentation > Fix For: 1.6, 1.5.17 > > > This issue serves as collection of all changes to the storage format > introduced with Oak Segment Tar and their impact. Once sufficiently > stabilised this information should serve as basis for the documentation in > {{oak-doc}}. > || Change || Rational || Impact || Migration || Since || Issues || > |Generation in segment header |Required to unequivocally determine the > generation of a segment during cleanup. Segment retention time is given in > number of generations (2 by default). |No performance, space impact expected > |offline |0.0.2 |OAK-3348 | > |Stable id for node states |Required to efficiently determine equality of > node states. This can be seen as an intermediate step to decoupling the > address of records from their identity. The next step is to introduce logical > record ids (OAK-4659). |Node states increase by the size of one record id (3 > bytes / 20 bytes after OAK-4631). On top of that there is an additional block > record à 18 bytes per node state. |offline |0.0.2 |OAK-3348 > |Binary index in tar files |Avoid traversing the repository to collect the gc > roots for DSGC. Fetch them from an index instead. |Additional index entry per > tar file. Adds a couple of bytes per external binary to each tar file. Exact > size to be determined. [~frm] could you help with this? OAK-4740 is a > regression wrt. to resiliency caused by this change (and the fact that the > blob store might return blob ids longer than 2k chars). |offline |0.0.4 > |OAK-4101 > |Simplified record ids |Preparation and precondition for logical record ids > (OAK-4659). At the same time the simplest possible fix for OAK-2896. The > latter leads to degeneration of segment sizes, which in turn has adverse > effects on overall performance, resource utilisation and memory requirements. > Without this fix OAK-2498 would need to be fixed in a different way that > would require other changes in the storage format. I started to regard this > issue as removing a premature optimisation (which caused OAK-2498). OTOH with > OAK-4844 we should also start looking into mitigations and what those would > mean to size vs. simplicity vs. performance. |Record ids grow from 3 bytes > to 18 bytes when serialised into records. Impact on repositories to be > assessed but can be anywhere between almost none to x6. OAK-4812 is a > performance regression caused by this chance. Its overall impact is yet to be > assessed. |offline |0.0.10 |OAK-4631, OAK-4844 > |Storage format versioning |In order to be able to further evolve the storage > format with minimal impact on existing deployments we need to carefully > versions the various storage entities (segments, tar files, etc.) |No > performance, space impact expected |offline |0.0.2/ 0.0.10 |OAK-4232, > OAK-4683, OAK-4295 > |Logical record ids |We need to separate addresses of records from their > identity to be able to further scale the TarMK. OAK-3348 (the online > compaction misery) can be seen as a symptom of failing to understand this > earlier. The stable ids introduced with OAK-3348 are a first step into this > direction. However this is not sufficient to implement features like e.g. > background compaction (OAK-4756), partial compaction (OAK-3349) or > incremental compaction (OAK-3350). |A small size overhead per segment for > the logical id table. Further impact to be evaluated ([~frm], please add your > assessment here). |offline |0.0.14 (planned) |OAK-4659 > |External index for segments |Avoid recreating tar files if indexes are > corrupt/missing. Just recreate the indexes. |Faster startup after a crash. > Overall less disk space usage as no unnecessary backup files are created. > |online |not yet planned |OAK-4649 > |In-place journal |Reduce complexity by in-lining the journal log. Less > files, less chances to break something. Also the granularity of the log would > increase as flushing of the persisted head would not be required any more. > Resilience would improve as the roll-back functionality could operate at a > finer granularity. |No more journal.log. Better resiliency. Significant risk > for regression of OAK-4291 if not implemented properly. Most likely a > significant refactoring of some parts of the code is required before we can > proceed with this issue. |online |not yet planned |OAK-4103 > |Root record types |With the information currently available from the segment > headers we cannot collect statistics about segment usage on repositories of > non trivial sizes. This fix would allow us to build more scalable tools to > that respect. |None expected wrt. to performance and size under normal > operation. |offline |0.0.14 (planned) (waiting for OAK-4659 as implementation > depends on how we progress there) |OAK-2498 > Misc ideas currently on the back burner: > * SegmentMK: Arch segments (OAK-1905) > * Extension headers for segments (no issue yet) > * More memory efficient serialisation of values (e.g. boolean) (no issue yet) > * Protocol Buffer for serialising records (no issue yet) -- This message was sent by Atlassian JIRA (v6.3.4#6332)