Hi, Yesterday I took some time for a little experiment: how many optimisations can be removed from the current segment format while maintaining the same functionality?
I made some work in a branch on GitHub [1]. The code on that branch is similar to the current trunk except for the following changes: 1. Record IDs are always serialised in their entirety. As such, a serialised record ID occupies 18 bytes instead of 3. 2. Because of the previous change, the table of referenced segment IDs is not needed anymore, so I removed it from the segment header. It turns out that this table is indeed needed for the mark phase of compaction, so this feature is broken in that branch. Anyway, since the code is in a runnable state, I generated some content using the current trunk and the dumber version of oak-segment-tar. This is the repository created by the dumb oak-segment-tar: 524744 data00000a.tar 524584 data00001a.tar 524688 data00002a.tar 460896 data00003a.tar 8 journal.log 0 repo.lock This is the one created by the current trunk: 524864 data00000a.tar 524656 data00001a.tar 524792 data00002a.tar 297288 data00003a.tar 8 journal.log 0 repo.lock The process that generates the content doesn't change between the two executions, and the generated content is coming from a real world scenario. For those familiar with it, the content is generated by an installation of Adobe Experience Manager. It looks like that the size of the repository is not changing so much. Probably the de-optimisation in the small is dwarfed by the binary content in the large. Another effect of my change is that there is no limit on the number of referenced segment IDs per segment, and this might allow segments to pack more records than before. Questions apart, the clear advantage of this change is a great simplification of the code. I guess I can remove some lines more, but what I peeled off is already a considerable amount. Look at the code! Francesco [1]: https://github.com/francescomari/jackrabbit-oak/tree/dumb