[ 
https://issues.apache.org/jira/browse/OAK-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Dürig updated OAK-4833:
-------------------------------
    Description: 
This issue serves as collection of all changes to the storage format introduced 
with  Oak Segment Tar and their impact. Once sufficiently stabilised this 
information should serve as basis for the documentation in {{oak-doc}}. 

|| Change || Rational || Impact || Migration || Since || Issues ||
|Generation in segment header |Required to unequivocally determine the 
generation of a segment during cleanup. Segment retention time is given in 
number of generations (2 by default). |No performance, space impact expected 
|offline |0.0.2 |OAK-3348 | 
|Stable id for node states |Required to efficiently determine equality of node 
states. This can be seen as an intermediate step to decoupling the address of 
records from their identity. The next step is to introduce logical record ids 
(OAK-4659). |Node states increase by the size of one record id (3 bytes / 20 
bytes after OAK-4631). On top of that there is an additional block record à 18 
bytes per node state. |offline |0.0.2 |OAK-3348
|Binary index in tar files |Avoid traversing the repository to collect the gc 
roots for DSGC. Fetch them from an index instead. |Additional index entry per 
tar file. Adds a couple of bytes per external binary to each tar file. Exact 
size to be determined. [~frm] could you help with this? OAK-4740 is a 
regression wrt. to resiliency caused by this change (and the fact that the blob 
store might return blob ids longer than 2k chars).  |offline |0.0.4 |OAK-4101
|Simplified record ids |Preparation and precondition for logical record ids 
(OAK-4659). At the same time the simplest possible fix for OAK-2896. The latter 
leads to degeneration of segment sizes, which in turn has adverse effects on 
overall performance, resource utilisation and memory requirements. Without this 
fix OAK-2498 would need to be fixed in a different way that would require other 
changes in the storage format. I started to regard this issue as removing a 
premature optimisation (which caused OAK-2498). OTOH with OAK-4844 we should 
also start looking into mitigations and what those would mean to size vs. 
simplicity vs. performance.  |Record ids grow from 3 bytes to 18 bytes when 
serialised into records. Impact on repositories to be assessed but can be 
anywhere between almost none to x6. OAK-4812 is a performance regression caused 
by this chance. Its overall impact is yet to be assessed. |offline |0.0.10 
|OAK-4631, OAK-4844
|Storage format versioning |In order to be able to further evolve the storage 
format with minimal impact on existing deployments we need to carefully 
versions the various storage entities (segments, tar files, etc.) |No 
performance, space impact expected |offline |0.0.2/ 0.0.10 |OAK-4232, OAK-4683, 
OAK-4295
|Logical record ids |We need to separate addresses of records from their 
identity to be able to further scale the TarMK. OAK-3348 (the online compaction 
misery) can be seen as a symptom of failing to understand this earlier. The 
stable ids introduced with OAK-3348 are a first step into this direction. 
However this is not sufficient to implement features like e.g. background 
compaction (OAK-4756), partial compaction (OAK-3349) or incremental compaction 
(OAK-3350).  |A small size overhead per segment for the logical id table. 
Further impact to be evaluated ([~frm], please add your assessment here). 
|offline |0.0.14 (planned) |OAK-4659
|External index for segments |Avoid recreating tar files if indexes are 
corrupt/missing. Just recreate the indexes. |Faster startup after a crash. 
Overall less disk space usage as no unnecessary backup files are created. 
|online |not yet planned |OAK-4649
|In-place journal |Reduce complexity by in-lining the journal log. Less files, 
less chances to break something. Also the granularity of the log would increase 
as flushing of the persisted head would not be required any more. Resilience 
would improve as the roll-back functionality could operate at a finer 
granularity. |No more journal.log. Better resiliency. Significant risk for 
regression of OAK-4291 if not implemented properly. Most likely a significant 
refactoring of some parts of the code is required before we can proceed with 
this issue.  |online |not yet planned |OAK-4103
|Root record types |With the information currently available from the segment 
headers we cannot collect statistics about segment usage on repositories of non 
trivial sizes. This fix would allow us to build more scalable tools to that 
respect.  |None expected wrt. to performance and size under normal operation. 
|offline |0.0.14 (planned) (waiting for OAK-4659 as implementation depends on 
how we progress there) |OAK-2498

Misc ideas currently on the back burner:
* SegmentMK: Arch segments (OAK-1905)
* Extension headers for segments (no issue yet)
* More memory efficient serialisation of values (e.g. boolean) (no issue yet)
* Protocol Buffer for serialising records (no issue yet)



  was:
This issue serves as collection of all changes to the storage format introduced 
with  Oak Segment Tar and their impact. Once sufficiently stabilised this 
information should serve as basis for the documentation in {{oak-doc}}. 

|| Change || Rational || Impact || Migration || Since || Issues ||
|Generation in segment header |Required to unequivocally determine the 
generation of a segment during cleanup. Segment retention time is given in 
number of generations (2 by default). |No performance, space impact expected 
|offline |0.0.2 |OAK-3348 | 
|Stable id for node states |Required to efficiently determine equality of node 
states. This can be seen as an intermediate step to decoupling the address of 
records from their identity. The next step is to introduce logical record ids 
(OAK-4659). |Node states increase by the size of one record id (3 bytes / 20 
bytes after OAK-4631). On top of that there is an additional block record à 18 
bytes per node state. |offline |0.0.2 |OAK-3348
|Binary index in tar files |Avoid traversing the repository to collect the gc 
roots for DSGC. Fetch them from an index instead. |Additional index entry per 
tar file. Adds a couple of bytes per external binary to each tar file. Exact 
size to be determined. [~frm] could you help with this? OAK-4740 is a 
regression wrt. to resiliency caused by this change (and the fact that the blob 
store might return blob ids longer than 2k chars).  |offline |0.0.4 |OAK-4101
|Simplified record ids |Preparation and precondition for logical record ids 
(OAK-4659). At the same time the simplest possible fix for OAK-2896. The latter 
leads to degeneration of segment sizes, which in turn has adverse effects on 
overall performance, resource utilisation and memory requirements. Without this 
fix OAK-2498 would need to be fixed in a different way that would require other 
changes in the storage format. I started to regard this issue as removing a 
premature optimisation (which caused OAK-2498).  |Record ids grow from 3 bytes 
to 18 bytes when serialised into records. Impact on repositories to be assessed 
but can be anywhere between almost none to x6. OAK-4812 is a performance 
regression caused by this chance. Its overall impact is yet to be assessed. 
|offline |0.0.10 |OAK-4631
|Storage format versioning |In order to be able to further evolve the storage 
format with minimal impact on existing deployments we need to carefully 
versions the various storage entities (segments, tar files, etc.) |No 
performance, space impact expected |offline |0.0.2/ 0.0.10 |OAK-4232, OAK-4683, 
OAK-4295
|Logical record ids |We need to separate addresses of records from their 
identity to be able to further scale the TarMK. OAK-3348 (the online compaction 
misery) can be seen as a symptom of failing to understand this earlier. The 
stable ids introduced with OAK-3348 are a first step into this direction. 
However this is not sufficient to implement features like e.g. background 
compaction (OAK-4756), partial compaction (OAK-3349) or incremental compaction 
(OAK-3350).  |A small size overhead per segment for the logical id table. 
Further impact to be evaluated ([~frm], please add your assessment here). 
|offline |0.0.14 (planned) |OAK-4659
|External index for segments |Avoid recreating tar files if indexes are 
corrupt/missing. Just recreate the indexes. |Faster startup after a crash. 
Overall less disk space usage as no unnecessary backup files are created. 
|online |not yet planned |OAK-4649
|In-place journal |Reduce complexity by in-lining the journal log. Less files, 
less chances to break something. Also the granularity of the log would increase 
as flushing of the persisted head would not be required any more. Resilience 
would improve as the roll-back functionality could operate at a finer 
granularity. |No more journal.log. Better resiliency. Significant risk for 
regression of OAK-4291 if not implemented properly. Most likely a significant 
refactoring of some parts of the code is required before we can proceed with 
this issue.  |online |not yet planned |OAK-4103
|Root record types |With the information currently available from the segment 
headers we cannot collect statistics about segment usage on repositories of non 
trivial sizes. This fix would allow us to build more scalable tools to that 
respect.  |None expected wrt. to performance and size under normal operation. 
|offline |0.0.14 (planned) (waiting for OAK-4659 as implementation depends on 
how we progress there) |OAK-2498

Misc ideas currently on the back burner:
* SegmentMK: Arch segments (OAK-1905)
* Extension headers for segments (no issue yet)
* More memory efficient serialisation of values (e.g. boolean) (no issue yet)
* Protocol Buffer for serialising records (no issue yet)




> Document storage format changes
> -------------------------------
>
>                 Key: OAK-4833
>                 URL: https://issues.apache.org/jira/browse/OAK-4833
>             Project: Jackrabbit Oak
>          Issue Type: Technical task
>          Components: doc, segment-tar
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>              Labels: documentation
>
> This issue serves as collection of all changes to the storage format 
> introduced with  Oak Segment Tar and their impact. Once sufficiently 
> stabilised this information should serve as basis for the documentation in 
> {{oak-doc}}. 
> || Change || Rational || Impact || Migration || Since || Issues ||
> |Generation in segment header |Required to unequivocally determine the 
> generation of a segment during cleanup. Segment retention time is given in 
> number of generations (2 by default). |No performance, space impact expected 
> |offline |0.0.2 |OAK-3348 | 
> |Stable id for node states |Required to efficiently determine equality of 
> node states. This can be seen as an intermediate step to decoupling the 
> address of records from their identity. The next step is to introduce logical 
> record ids (OAK-4659). |Node states increase by the size of one record id (3 
> bytes / 20 bytes after OAK-4631). On top of that there is an additional block 
> record à 18 bytes per node state. |offline |0.0.2 |OAK-3348
> |Binary index in tar files |Avoid traversing the repository to collect the gc 
> roots for DSGC. Fetch them from an index instead. |Additional index entry per 
> tar file. Adds a couple of bytes per external binary to each tar file. Exact 
> size to be determined. [~frm] could you help with this? OAK-4740 is a 
> regression wrt. to resiliency caused by this change (and the fact that the 
> blob store might return blob ids longer than 2k chars).  |offline |0.0.4 
> |OAK-4101
> |Simplified record ids |Preparation and precondition for logical record ids 
> (OAK-4659). At the same time the simplest possible fix for OAK-2896. The 
> latter leads to degeneration of segment sizes, which in turn has adverse 
> effects on overall performance, resource utilisation and memory requirements. 
> Without this fix OAK-2498 would need to be fixed in a different way that 
> would require other changes in the storage format. I started to regard this 
> issue as removing a premature optimisation (which caused OAK-2498). OTOH with 
> OAK-4844 we should also start looking into mitigations and what those would 
> mean to size vs. simplicity vs. performance.  |Record ids grow from 3 bytes 
> to 18 bytes when serialised into records. Impact on repositories to be 
> assessed but can be anywhere between almost none to x6. OAK-4812 is a 
> performance regression caused by this chance. Its overall impact is yet to be 
> assessed. |offline |0.0.10 |OAK-4631, OAK-4844
> |Storage format versioning |In order to be able to further evolve the storage 
> format with minimal impact on existing deployments we need to carefully 
> versions the various storage entities (segments, tar files, etc.) |No 
> performance, space impact expected |offline |0.0.2/ 0.0.10 |OAK-4232, 
> OAK-4683, OAK-4295
> |Logical record ids |We need to separate addresses of records from their 
> identity to be able to further scale the TarMK. OAK-3348 (the online 
> compaction misery) can be seen as a symptom of failing to understand this 
> earlier. The stable ids introduced with OAK-3348 are a first step into this 
> direction. However this is not sufficient to implement features like e.g. 
> background compaction (OAK-4756), partial compaction (OAK-3349) or 
> incremental compaction (OAK-3350).  |A small size overhead per segment for 
> the logical id table. Further impact to be evaluated ([~frm], please add your 
> assessment here). |offline |0.0.14 (planned) |OAK-4659
> |External index for segments |Avoid recreating tar files if indexes are 
> corrupt/missing. Just recreate the indexes. |Faster startup after a crash. 
> Overall less disk space usage as no unnecessary backup files are created. 
> |online |not yet planned |OAK-4649
> |In-place journal |Reduce complexity by in-lining the journal log. Less 
> files, less chances to break something. Also the granularity of the log would 
> increase as flushing of the persisted head would not be required any more. 
> Resilience would improve as the roll-back functionality could operate at a 
> finer granularity. |No more journal.log. Better resiliency. Significant risk 
> for regression of OAK-4291 if not implemented properly. Most likely a 
> significant refactoring of some parts of the code is required before we can 
> proceed with this issue.  |online |not yet planned |OAK-4103
> |Root record types |With the information currently available from the segment 
> headers we cannot collect statistics about segment usage on repositories of 
> non trivial sizes. This fix would allow us to build more scalable tools to 
> that respect.  |None expected wrt. to performance and size under normal 
> operation. |offline |0.0.14 (planned) (waiting for OAK-4659 as implementation 
> depends on how we progress there) |OAK-2498
> Misc ideas currently on the back burner:
> * SegmentMK: Arch segments (OAK-1905)
> * Extension headers for segments (no issue yet)
> * More memory efficient serialisation of values (e.g. boolean) (no issue yet)
> * Protocol Buffer for serialising records (no issue yet)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to