Hi,

Cool! I'm pretty sure there are various ways in which the format could be
improved, as the original design was based mostly on intuition, guided
somewhat by collected stats <http://markmail.org/message/kxe3iy2hnodxsghe> and
the micro-benchmarks <https://issues.apache.org/jira/browse/OAK-119> used
to optimize common operations.

Note though that the total size of the repository was not and probably
shouldn't be a primary metric, since the size of a typical repository is
governed mostly by binaries and string properties (though it's a good idea
to make sure you avoid things like duplicates of large binaries). Instead
the rationale for squeezing things like record ids to as few bytes as
possible is captured in the principles listed in the original design doc
<http://jackrabbit.apache.org/oak/docs/nodestore/segmentmk.html>:

   - Compactness. The formatting of records is optimized for size to reduce
   IO costs and to fit as much content in caches as possible. A node stored in
   SegmentNodeStore typically consumes only a fraction of the size it would as
   a bundle in Jackrabbit Classic.
   - Locality. Segments are written so that related records, like a node
   and its immediate children, usually end up stored in the same segment. This
   makes tree traversals very fast and avoids most cache misses for typical
   clients that access more than one related node per session.

Thus I would recommend keeping an eye also on benchmark results in addition
to raw repository size when evaluating possible improvements. Also, the
number and size of data segments are good size metrics to look at in
addition to total disk usage.

BR,

Jukka Zitting

On Fri, Jul 22, 2016 at 5:55 AM Francesco Mari <[email protected]>
wrote:

> The impact on repository size needs to be assessed with more specific
> tests. In particular, I found RecordUsageAnalyserTest and
> SegmentSizeTest unsuitable to this task. It's not a coincidence that
> these tests are usually the first to be disabled or blindly updated
> every time a small fix changes the size of the records.
>
> Regarding GC, the segment graph could be computed during the mark
> phase. Of course, it's handy to have this information pre-computed for
> you, but since the record graph is traversed anyway we could think
> about dynamically reconstructing the segment graph when needed.
>
> There are still so many questions to answer, but I think that this
> simplification exercise can be worth the effort.
>
> 2016-07-22 11:34 GMT+02:00 Michael Dürig <[email protected]>:
> >
> > Hi,
> >
> > Neat! I would have expected a greater impact on the size of the segment
> > store. But as you say it probably all depends on the binary/content
> ratio. I
> > think we should look at the #references / repository size ratio for
> > repositories of different structures and see how such a number differs
> with
> > and without the patch.
> >
> > I like the patch as it fixes OAK-2896 while at the same time reducing
> > complexity a lot.
> >
> > OTOH we need to figure out how to regain the lost functionality (e.g. gc)
> > and asses its impact on repository size.
> >
> > Michael
> >
> >
> >
> > On 22.7.16 11:32 , Francesco Mari wrote:
> >>
> >> Hi,
> >>
> >> Yesterday I took some time for a little experiment: how many
> >> optimisations can be removed from the current segment format while
> >> maintaining the same functionality?
> >>
> >> I made some work in a branch on GitHub [1]. The code on that branch is
> >> similar to the current trunk except for the following changes:
> >>
> >> 1. Record IDs are always serialised in their entirety. As such, a
> >> serialised record ID occupies 18 bytes instead of 3.
> >>
> >> 2. Because of the previous change, the table of referenced segment IDs
> >> is not needed anymore, so I removed it from the segment header. It
> >> turns out that this table is indeed needed for the mark phase of
> >> compaction, so this feature is broken in that branch.
> >>
> >> Anyway, since the code is in a runnable state, I generated some
> >> content using the current trunk and the dumber version of
> >> oak-segment-tar. This is the repository created by the dumb
> >> oak-segment-tar:
> >>
> >> 524744 data00000a.tar
> >> 524584 data00001a.tar
> >> 524688 data00002a.tar
> >> 460896 data00003a.tar
> >> 8 journal.log
> >> 0 repo.lock
> >>
> >> This is the one created by the current trunk:
> >>
> >> 524864 data00000a.tar
> >> 524656 data00001a.tar
> >> 524792 data00002a.tar
> >> 297288 data00003a.tar
> >> 8 journal.log
> >> 0 repo.lock
> >>
> >> The process that generates the content doesn't change between the two
> >> executions, and the generated content is coming from a real world
> >> scenario. For those familiar with it, the content is generated by an
> >> installation of Adobe Experience Manager.
> >>
> >> It looks like that the size of the repository is not changing so much.
> >> Probably the de-optimisation in the small is dwarfed by the binary
> >> content in the large. Another effect of my change is that there is no
> >> limit on the number of referenced segment IDs per segment, and this
> >> might allow segments to pack more records than before.
> >>
> >> Questions apart, the clear advantage of this change is a great
> >> simplification of the code. I guess I can remove some lines more, but
> >> what I peeled off is already a considerable amount. Look at the code!
> >>
> >> Francesco
> >>
> >> [1]: https://github.com/francescomari/jackrabbit-oak/tree/dumb
> >>
> >
>

Reply via email to