[
https://issues.apache.org/jira/browse/OAK-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190912#comment-15190912
]
Michael Dürig commented on OAK-3348:
------------------------------------
After much research I think I finally came up with a way to get rid of
references to pre
compacted segments. Please have a look at my respective [GitHub
branch|https://github.com/mduerig/jackrabbit-oak/tree/OAK-3348.poc].
There is a *HUGE* amount of changes. Around 10 kloc as
[patch|https://github.com/apache/jackrabbit-oak/compare/461f113c7bf7bec3fa3626cca84db6619e5fc372...c269193ee4b824b62c2d03e9f097b3a0dc0e61df.patch].
I structured the changes into various commits relating similar concerns as
summarised in the
individual commit messages. Most of the time I resisted unrelated refactorings
as not to
complicate things further. Instead I just left [FIXME
michid|https://github.com/mduerig/jackrabbit-oak/commit/c269193ee4b824b62c2d03e9f097b3a0dc0e61df]
tags into the code. However, sometimes I just had to refactor stuff in order to
get a
clearer vision. Also I probably broke some things not immediately concerned
(like
upgrade, backup, restore, standby, upgrade and tooling). I left the same tags
in the code for
those places I knew would need following up.
The core of this patch comes from the realisation that the compaction map is
serving three
different concerns: deduplication, equality (of node states across different GC
generations) and
tracking GC generations. Furthermore for equality and GC generations the
compaction map
served as an external structure that had to be kept in sync with the records
and segments managed
elsewhere. This resulted in overly complex coordination and synchronisation of
global state,
which was nearly impossible to get right. In my patch I completely separate
those three concerns
allowing me to get rid of the compaction map entirely.
* The GC generation is now written into the segment header and segments can
only contain records
pertaining to that generation. The current generation is then simply the
generation of the
segment containing the current head node state. As a result incrementing the
generation is now
implicit with {{SegmentStore.setHead}} and thus free of races. It happens
when the result of
the {{Compactor}} is applied.
* Equality of segment node states is now id based: each {{SegmentNodeState}}
contains an id.
Equality of the id is a necessary condition for equality of those node
states. Again, as this id
is persisted along with the states, we don't have to rely on and maintain
external structures
(like the compaction map) that are prone to races.
* As a direct consequence of not having to rely on an external map for
equality, we can tweak
the map we need for deduplication: there is no need for fully persistent keys
anymore and we
can use a cache instead. This has the advantage that we can specifically teak
the cache to
this single use case and adapt it to the available memory. That is, trading
off the
efficiency of deduplication. Deduplication is done in the {{SegmentWriter}}
where I added
a special cache for node states that gives precedence to nodes high up in the
tree.
The patch contains quite a number of [FIXME
michid|https://github.com/mduerig/jackrabbit-oak/commit/c269193ee4b824b62c2d03e9f097b3a0dc0e61df]
tags for things that need fixing/evaluating/analysing/improving. Most
prominently:
* We need a good mechanism for creating ids for segment node states. Currently
I use a counter,
which I consider good enough for the POC but not fit for production.
* What is the effect of those additional ids on repository size?
* Having to read the GC generation from the segment potentially adds some
overhead as we now need
to map the segment into memory where before it was sufficient to look at its
storage address.
Maybe we ca write the generation into the tar file index!?
* Validate and optimise the segment node state deduplication cache. What sizing
works? Do we
also need to take the number of child node into account or is the position in
the hierarchy
sufficient? Do we need pinning (for checkpoints)?
* Do we need a cache for de-duplicating binaries? We didn't have this so far
expect for in
the compactor. As the latter now relies on deduplication done in the segment
writer this is
something we should look into. OTOH, I improved deduplication for large
binaries by rewriting
the list of its block ids when copying a segment stream. As we should
recommend using a data
store for working with large binaries this could actually be sufficient.
* I particularly dislike the way the deduplication cache is handed from the
compactor to the
segment writer. We should refactor this part into a clean solution. Also
cache eviction is
currently handled from outside (for all record caches actually). It would be
cleaner to do
this from inside the caches themselves.
* We need good monitoring for these caches.
Further down the line we also need to think about upgrading existing
repositories as obviously
there are changes to the segment format here.
> Cross gc sessions might introduce references to pre-compacted segments
> ----------------------------------------------------------------------
>
> Key: OAK-3348
> URL: https://issues.apache.org/jira/browse/OAK-3348
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: segmentmk
> Reporter: Michael Dürig
> Assignee: Michael Dürig
> Labels: candidate_oak_1_0, candidate_oak_1_2, cleanup,
> compaction, gc
> Fix For: 1.6
>
> Attachments: OAK-3348-1.patch, OAK-3348-2.patch, OAK-3348.patch,
> cross-gc-refs.pdf, image.png
>
>
> I suspect that certain write operations during compaction can cause
> references from compacted segments to pre-compacted ones. This would
> effectively prevent the pre-compacted segments from getting evicted in
> subsequent cleanup phases.
> The scenario is as follows:
> * A session is opened and a lot of content is written to it such that the
> update limit is exceeded. This causes the changes to be written to disk.
> * Revision gc runs causing a new, compacted root node state to be written to
> disk.
> * The session saves its changes. This causes rebasing of its changes onto the
> current root (the compacted one). At this point any node that has been added
> will be added again in the sub-tree rooted at the current root. Such nodes
> however might have been written to disk *before* revision gc ran and might
> thus be contained in pre-compacted segments. As I suspect the node-add
> operation in the rebasing process *not* to create a deep copy of such nodes
> but to rather create a *reference* to them, a reference to a pre-compacted
> segment is introduced here.
> Going forward we need to validate above hypothesis, assess its impact if
> necessary come up with a solution.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)