[
https://issues.apache.org/jira/browse/OAK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310490#comment-15310490
]
Michael Dürig commented on OAK-4279:
------------------------------------
Some food for thoughts:
* Do we need to de-duplicate binaries through {{Map<String, List<RecordId>>
binaries}}? After all there is some de-duplication already happening in
{{SegmentWriter.SegmentWriteOperation#writeStream}}. That method tries to
extract the list of ids of the bulks of a stream and just rewrites that list
instead of all bulks. Obviously this doesn't help if the same blob is already
persisted more than once into different bulks. However I wonder how often this
actually happens in practice. At least I think this is very unlikely to happen
for large blobs. So maybe we should only do the extra de-duplication in the
compactor only for small blobs shaving off quite a bit of IO and CPU cycles now
spent in fully reading all those blobs for comparing them.
* I'm also not sure if we need the explicit {{RecordCache}} instance for node
states. The {{WriterCacheManager}} of the segment writer has a de-duplication
cache for nodes already ({{NodeCache}}). If this doesn't work for the offline
compaction case, we need at least disable that cache then so it doesn't consume
extra memory for nothing.
* Just noted that binaries are also put into that {{RecordCache}}. I think this
definitely not necessary as those are de-duplicated via the mechanism described
in the first paragraph.
* Do we still need {{Compactor#setContentEqualityCheck}} and related?
> Rework offline compaction
> -------------------------
>
> Key: OAK-4279
> URL: https://issues.apache.org/jira/browse/OAK-4279
> Project: Jackrabbit Oak
> Issue Type: Task
> Components: segment-tar
> Reporter: Michael Dürig
> Assignee: Alex Parvulescu
> Priority: Blocker
> Labels: compaction, gc
> Fix For: 1.6
>
> Attachments: OAK-4279-v0.patch, OAK-4279-v1.patch, OAK-4279-v2.patch,
> OAK-4279-v3.patch, OAK-4279-v4.patch
>
>
> The fix for OAK-3348 broke some of the previous functionality of offline
> compaction:
> * No more progress logging
> * Compaction is not interruptible any more (in the sense of OAK-3290)
> * Offline compaction could remove the ids of the segment node states to
> squeeze out some extra space. Those are only needed for later generations
> generated via online compaction.
> We should probably implement offline compaction again through a dedicated
> {{Compactor}} class as it was done in {{oak-segment}} instead of relying on
> the de-duplication cache (aka online compaction).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)