[ 
https://issues.apache.org/jira/browse/OAK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310490#comment-15310490
 ] 

Michael Dürig commented on OAK-4279:
------------------------------------

Some food for thoughts: 

* Do we need to de-duplicate binaries through {{Map<String, List<RecordId>> 
binaries}}? After all there is some de-duplication already happening in 
{{SegmentWriter.SegmentWriteOperation#writeStream}}. That method tries to 
extract the list of ids of the bulks of a stream and just rewrites that list 
instead of all bulks. Obviously this doesn't help if the same blob is already 
persisted more than once into different bulks. However I wonder how often this 
actually happens in practice. At least I think this is very unlikely to happen 
for large blobs. So maybe we should only do the extra de-duplication in the 
compactor only for small blobs shaving off quite a bit of IO and CPU cycles now 
spent in fully reading all those blobs for comparing them. 

* I'm also not sure if we need the explicit {{RecordCache}} instance for node 
states. The {{WriterCacheManager}} of the segment writer has a de-duplication 
cache for nodes already ({{NodeCache}}). If this doesn't work for the offline 
compaction case, we need at least disable that cache then so it doesn't consume 
extra memory for nothing. 

* Just noted that binaries are also put into that {{RecordCache}}. I think this 
definitely not necessary as those are de-duplicated via the mechanism described 
in the first paragraph. 

* Do we still need {{Compactor#setContentEqualityCheck}} and related?



> Rework offline compaction
> -------------------------
>
>                 Key: OAK-4279
>                 URL: https://issues.apache.org/jira/browse/OAK-4279
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: segment-tar
>            Reporter: Michael Dürig
>            Assignee: Alex Parvulescu
>            Priority: Blocker
>              Labels: compaction, gc
>             Fix For: 1.6
>
>         Attachments: OAK-4279-v0.patch, OAK-4279-v1.patch, OAK-4279-v2.patch, 
> OAK-4279-v3.patch, OAK-4279-v4.patch
>
>
> The fix for OAK-3348 broke some of the previous functionality of offline 
> compaction:
> * No more progress logging
> * Compaction is not interruptible any more (in the sense of OAK-3290)
> * Offline compaction could remove the ids of the segment node states to 
> squeeze out some extra space. Those are only needed for later generations 
> generated via online compaction. 
> We should probably implement offline compaction again through a dedicated 
> {{Compactor}} class as it was done in {{oak-segment}} instead of relying on 
> the de-duplication cache (aka online compaction). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to