[ 
https://issues.apache.org/jira/browse/OAK-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190912#comment-15190912
 ] 

Michael Dürig commented on OAK-3348:
------------------------------------

After much research I think I finally came up with a way to get rid of 
references to pre
compacted segments. Please have a look at my respective [GitHub 
branch|https://github.com/mduerig/jackrabbit-oak/tree/OAK-3348.poc].

There is a *HUGE* amount of changes. Around 10 kloc as 
[patch|https://github.com/apache/jackrabbit-oak/compare/461f113c7bf7bec3fa3626cca84db6619e5fc372...c269193ee4b824b62c2d03e9f097b3a0dc0e61df.patch].

I structured the changes into various commits relating similar concerns as 
summarised in the
individual commit messages. Most of the time I resisted unrelated refactorings 
as not to
complicate things further. Instead I just left [FIXME 
michid|https://github.com/mduerig/jackrabbit-oak/commit/c269193ee4b824b62c2d03e9f097b3a0dc0e61df]
tags into the code. However, sometimes I just had to refactor stuff in order to 
get a
clearer vision. Also I probably broke some things not immediately concerned 
(like
upgrade, backup, restore, standby, upgrade and tooling). I left the same tags 
in the code for
those places I knew would need following up.

The core of this patch comes from the realisation that the compaction map is 
serving three
different concerns: deduplication, equality (of node states across different GC 
generations) and
tracking GC generations. Furthermore for equality and GC generations the 
compaction map
served as an external structure that had to be kept in sync with the records 
and segments managed
elsewhere. This resulted in overly complex coordination and synchronisation of 
global state,
which was nearly impossible to get right. In my patch I completely separate 
those three concerns
allowing me to get rid of the compaction map entirely.

* The GC generation is now written into the segment header and segments can 
only contain records
  pertaining to that generation. The current generation is then simply the 
generation of the
  segment containing the current head node state. As a result incrementing the 
generation is now
  implicit with {{SegmentStore.setHead}} and thus free of races. It happens 
when the result of
  the {{Compactor}} is applied.

* Equality of segment node states is now id based: each {{SegmentNodeState}} 
contains an id.
  Equality of the id is a necessary condition for equality of those node 
states. Again, as this id
  is persisted along with the states, we don't have to rely on and maintain 
external structures
  (like the compaction map) that are prone to races.

* As a direct consequence of not having to rely on an external map for 
equality, we can tweak
  the map we need for deduplication: there is no need for fully persistent keys 
anymore and we
  can use a cache instead. This has the advantage that we can specifically teak 
the cache to
  this single use case and adapt it to the available memory. That is, trading 
off the
  efficiency of deduplication. Deduplication is done in the {{SegmentWriter}} 
where I added
  a special cache for node states that gives precedence to nodes high up in the 
tree.

The patch contains quite a number of [FIXME 
michid|https://github.com/mduerig/jackrabbit-oak/commit/c269193ee4b824b62c2d03e9f097b3a0dc0e61df]
tags for things that need fixing/evaluating/analysing/improving. Most 
prominently:

* We need a good mechanism for creating ids for segment node states. Currently 
I use a counter,
  which I consider good enough for the POC but not fit for production.

* What is the effect of those additional ids on repository size?

* Having to read the GC generation from the segment potentially adds some 
overhead as we now need
  to map the segment into memory where before it was sufficient to look at its 
storage address.
  Maybe we ca write the generation into the tar file index!?

* Validate and optimise the segment node state deduplication cache. What sizing 
works? Do we
  also need to take the number of child node into account or is the position in 
the hierarchy
  sufficient? Do we need pinning (for checkpoints)?

* Do we need a cache for de-duplicating binaries? We didn't have this so far 
expect for in
  the compactor. As the latter now relies on deduplication done in the segment 
writer this is
  something we should look into. OTOH, I improved deduplication for large 
binaries by rewriting
  the list of its block ids when copying a segment stream. As we should 
recommend using a data
  store for working with large binaries this could actually be sufficient.

* I particularly dislike the way the deduplication cache is handed from the 
compactor to the
  segment writer. We should refactor this part into a clean solution. Also 
cache eviction is
  currently handled from outside (for all record caches actually). It would be 
cleaner to do
  this from inside the caches themselves.

* We need good monitoring for these caches.

Further down the line we also need to think about upgrading existing 
repositories as obviously
there are changes to the segment format here.






> Cross gc sessions might introduce references to pre-compacted segments
> ----------------------------------------------------------------------
>
>                 Key: OAK-3348
>                 URL: https://issues.apache.org/jira/browse/OAK-3348
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: segmentmk
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>              Labels: candidate_oak_1_0, candidate_oak_1_2, cleanup, 
> compaction, gc
>             Fix For: 1.6
>
>         Attachments: OAK-3348-1.patch, OAK-3348-2.patch, OAK-3348.patch, 
> cross-gc-refs.pdf, image.png
>
>
> I suspect that certain write operations during compaction can cause 
> references from compacted segments to pre-compacted ones. This would 
> effectively prevent the pre-compacted segments from getting evicted in 
> subsequent cleanup phases. 
> The scenario is as follows:
> * A session is opened and a lot of content is written to it such that the 
> update limit is exceeded. This causes the changes to be written to disk. 
> * Revision gc runs causing a new, compacted root node state to be written to 
> disk.
> * The session saves its changes. This causes rebasing of its changes onto the 
> current root (the compacted one). At this point any node that has been added 
> will be added again in the sub-tree rooted at the current root. Such nodes 
> however might have been written to disk *before* revision gc ran and might 
> thus be contained in pre-compacted segments. As I suspect the node-add 
> operation in the rebasing process *not* to create a deep copy of such nodes 
> but to rather create a *reference* to them, a reference to a pre-compacted 
> segment is introduced here. 
> Going forward we need to validate above hypothesis, assess its impact if 
> necessary come up with a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to