[ 
https://issues.apache.org/jira/browse/OAK-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744873#comment-15744873
 ] 

Michael Dürig commented on OAK-4833:
------------------------------------

This issue is/was mainly about collecting all format changes we've done between 
{{oak-segment}} and {{oak-segment-tar}} in a central place. 

This should now somehow become part of our (developer) documentation about the 
tar and segment format (e.g. OAK-4648). Depending on whether we drop the 
documentation for the now deprecated {{oak-segment}} module right away or 
later, I think this should either be part of the documentation itself or just 
linked from it.

> Document storage format changes
> -------------------------------
>
>                 Key: OAK-4833
>                 URL: https://issues.apache.org/jira/browse/OAK-4833
>             Project: Jackrabbit Oak
>          Issue Type: Technical task
>          Components: doc, segment-tar
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>              Labels: documentation
>             Fix For: 1.6, 1.5.17
>
>
> This issue serves as collection of all changes to the storage format 
> introduced with  Oak Segment Tar and their impact. Once sufficiently 
> stabilised this information should serve as basis for the documentation in 
> {{oak-doc}}. 
> || Change || Rational || Impact || Migration || Since || Issues ||
> |Generation in segment header |Required to unequivocally determine the 
> generation of a segment during cleanup. Segment retention time is given in 
> number of generations (2 by default). |No performance, space impact expected 
> |offline |0.0.2 |OAK-3348 | 
> |Stable id for node states |Required to efficiently determine equality of 
> node states. This can be seen as an intermediate step to decoupling the 
> address of records from their identity. The next step is to introduce logical 
> record ids (OAK-4659). |Node states increase by the size of one record id (3 
> bytes / 20 bytes after OAK-4631). On top of that there is an additional block 
> record à 18 bytes per node state. |offline |0.0.2 |OAK-3348
> |Binary index in tar files |Avoid traversing the repository to collect the gc 
> roots for DSGC. Fetch them from an index instead. |Additional index entry per 
> tar file. Adds a couple of bytes per external binary to each tar file. Exact 
> size to be determined. [~frm] could you help with this? OAK-4740 is a 
> regression wrt. to resiliency caused by this change (and the fact that the 
> blob store might return blob ids longer than 2k chars).  |offline |0.0.4 
> |OAK-4101
> |Simplified record ids |Preparation and precondition for logical record ids 
> (OAK-4659). At the same time the simplest possible fix for OAK-2896. The 
> latter leads to degeneration of segment sizes, which in turn has adverse 
> effects on overall performance, resource utilisation and memory requirements. 
> Without this fix OAK-2498 would need to be fixed in a different way that 
> would require other changes in the storage format. I started to regard this 
> issue as removing a premature optimisation (which caused OAK-2498). OTOH with 
> OAK-4844 we should also start looking into mitigations and what those would 
> mean to size vs. simplicity vs. performance.  |Record ids grow from 3 bytes 
> to 18 bytes when serialised into records. Impact on repositories to be 
> assessed but can be anywhere between almost none to x6. OAK-4812 is a 
> performance regression caused by this chance. Its overall impact is yet to be 
> assessed. |offline |0.0.10 |OAK-4631, OAK-4844
> |Storage format versioning |In order to be able to further evolve the storage 
> format with minimal impact on existing deployments we need to carefully 
> versions the various storage entities (segments, tar files, etc.) |No 
> performance, space impact expected |offline |0.0.2/ 0.0.10 |OAK-4232, 
> OAK-4683, OAK-4295
> |Logical record ids |We need to separate addresses of records from their 
> identity to be able to further scale the TarMK. OAK-3348 (the online 
> compaction misery) can be seen as a symptom of failing to understand this 
> earlier. The stable ids introduced with OAK-3348 are a first step into this 
> direction. However this is not sufficient to implement features like e.g. 
> background compaction (OAK-4756), partial compaction (OAK-3349) or 
> incremental compaction (OAK-3350).  |A small size overhead per segment for 
> the logical id table. Further impact to be evaluated ([~frm], please add your 
> assessment here). |offline |0.0.14 (planned) |OAK-4659
> |External index for segments |Avoid recreating tar files if indexes are 
> corrupt/missing. Just recreate the indexes. |Faster startup after a crash. 
> Overall less disk space usage as no unnecessary backup files are created. 
> |online |not yet planned |OAK-4649
> |In-place journal |Reduce complexity by in-lining the journal log. Less 
> files, less chances to break something. Also the granularity of the log would 
> increase as flushing of the persisted head would not be required any more. 
> Resilience would improve as the roll-back functionality could operate at a 
> finer granularity. |No more journal.log. Better resiliency. Significant risk 
> for regression of OAK-4291 if not implemented properly. Most likely a 
> significant refactoring of some parts of the code is required before we can 
> proceed with this issue.  |online |not yet planned |OAK-4103
> |Root record types |With the information currently available from the segment 
> headers we cannot collect statistics about segment usage on repositories of 
> non trivial sizes. This fix would allow us to build more scalable tools to 
> that respect.  |None expected wrt. to performance and size under normal 
> operation. |offline |0.0.14 (planned) (waiting for OAK-4659 as implementation 
> depends on how we progress there) |OAK-2498
> Misc ideas currently on the back burner:
> * SegmentMK: Arch segments (OAK-1905)
> * Extension headers for segments (no issue yet)
> * More memory efficient serialisation of values (e.g. boolean) (no issue yet)
> * Protocol Buffer for serialising records (no issue yet)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to