Hi, On Tue, Sep 17, 2013 at 5:45 AM, Michael Dürig <[email protected]> wrote: > While looking into a OAK-1019 I was wondering whether the internal data > structure for the OffsetCache is actually the right choice. The arrays, > which are used to hold the offsets and the values will always grow but never > shrink. Although the values are soft referenced, their reference will still > take up its corresponding array slot after a gc.
The OffsetCaches are always associated and reference by Segment instances, which in most cases take up much more memory, so I'm not too worried about the memory overhead. The size of the OffsetCache arrays is bounded by the number of records that can possibly fit inside a single Segment (at least after the fix you found! :-). > Do we have an idea what the chances are such slots are reused? I presume the > offsets are not uniquely distributed across the whole offset space. > Otherwise there is a high chance of accumulating lots of unoccupied slots in > these arrays. A sparse array implementation might be a better choice in this > case. It's already a sparse array, as only those offsets that have already been accessed are present. In general my assumption here is that if say a string value is accessed, it's highly likely that it gets accessed again in near future. If it doesn't (for example if someone is doing a scan of the repository), it's also likely that nothing else in the same segment gets accessed again, in which case the whole OffsetCache structure would end up garbage collected. That said; I'm not bound to this particular implementation or design, it's just a simple solution I came up by intuition instead of more specific planning or benchmarks. As we start to focus more on optimizing these lower layers, I expect there to be a lot of room for improvement, both in design and implementation. BR, Jukka Zitting
