Re: SegmentMK OffsetCache

Jukka Zitting Tue, 17 Sep 2013 07:32:04 -0700

Hi,

On Tue, Sep 17, 2013 at 5:45 AM, Michael Dürig <[email protected]> wrote:
> While looking into a OAK-1019 I was wondering whether the internal data
> structure for the OffsetCache is actually the right choice. The arrays,
> which are used to hold the offsets and the values will always grow but never
> shrink. Although the values are soft referenced, their reference will still
> take up its corresponding array slot after a gc.


The OffsetCaches are always associated and reference by Segment
instances, which in most cases take up much more memory, so I'm not
too worried about the memory overhead. The size of the OffsetCache
arrays is bounded by the number of records that can possibly fit
inside a single Segment (at least after the fix you found! :-).

> Do we have an idea what the chances are such slots are reused? I presume the
> offsets are not uniquely distributed across the whole offset space.
> Otherwise there is a high chance of accumulating lots of unoccupied slots in
> these arrays. A sparse array implementation might be a better choice in this
> case.

It's already a sparse array, as only those offsets that have already
been accessed are present. In general my assumption here is that if
say a string value is accessed, it's highly likely that it gets
accessed again in near future. If it doesn't (for example if someone
is doing a scan of the repository), it's also likely that nothing else
in the same segment gets accessed again, in which case the whole
OffsetCache structure would end up garbage collected.

That said; I'm not bound to this particular implementation or design,
it's just a simple solution I came up by intuition instead of more
specific planning or benchmarks. As we start to focus more on
optimizing these lower layers, I expect there to be a lot of room for
improvement, both in design and implementation.

BR,

Jukka Zitting

Re: SegmentMK OffsetCache

Reply via email to