Hi,

Yesterday I took some time for a little experiment: how many
optimisations can be removed from the current segment format while
maintaining the same functionality?

I made some work in a branch on GitHub [1]. The code on that branch is
similar to the current trunk except for the following changes:

1. Record IDs are always serialised in their entirety. As such, a
serialised record ID occupies 18 bytes instead of 3.

2. Because of the previous change, the table of referenced segment IDs
is not needed anymore, so I removed it from the segment header. It
turns out that this table is indeed needed for the mark phase of
compaction, so this feature is broken in that branch.

Anyway, since the code is in a runnable state, I generated some
content using the current trunk and the dumber version of
oak-segment-tar. This is the repository created by the dumb
oak-segment-tar:

524744 data00000a.tar
524584 data00001a.tar
524688 data00002a.tar
460896 data00003a.tar
8 journal.log
0 repo.lock

This is the one created by the current trunk:

524864 data00000a.tar
524656 data00001a.tar
524792 data00002a.tar
297288 data00003a.tar
8 journal.log
0 repo.lock

The process that generates the content doesn't change between the two
executions, and the generated content is coming from a real world
scenario. For those familiar with it, the content is generated by an
installation of Adobe Experience Manager.

It looks like that the size of the repository is not changing so much.
Probably the de-optimisation in the small is dwarfed by the binary
content in the large. Another effect of my change is that there is no
limit on the number of referenced segment IDs per segment, and this
might allow segments to pack more records than before.

Questions apart, the clear advantage of this change is a great
simplification of the code. I guess I can remove some lines more, but
what I peeled off is already a considerable amount. Look at the code!

Francesco

[1]: https://github.com/francescomari/jackrabbit-oak/tree/dumb

Reply via email to