[ 
https://issues.apache.org/jira/browse/OAK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312243#comment-15312243
 ] 

Alex Parvulescu commented on OAK-4279:
--------------------------------------

To clarify a bit the 'de-duplication' concept, I've attached more tests 
([^OAK-4279-binaries.patch]).
I think there are 2 sides here:
* de-duplication of blobs by reference, 2/more nodes point to the same blob 
(covered by {{offlineCompactionBin2}})
The {{SegmentWriter}} does indeed cover this aspect nicely, still up for 
discussion if putting the binary recordids in the cache makes sense (need some 
numbers here wrt. IO involved extracting the recordids from the stream itself).

* de-duplication of blobs by content, IO intensive and the main subject of the 
debate (covered by {{offlineCompactionBin1}})
This covers the {{binaries}} map. some fair amount of IO involved, makes sense 
to put it behind a flag, disabled by default.

[~mduerig] do you think there are other cases we need to consider?

> Rework offline compaction
> -------------------------
>
>                 Key: OAK-4279
>                 URL: https://issues.apache.org/jira/browse/OAK-4279
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: segment-tar
>            Reporter: Michael Dürig
>            Assignee: Alex Parvulescu
>            Priority: Blocker
>              Labels: compaction, gc
>             Fix For: 1.6
>
>         Attachments: OAK-4279-binaries.patch, OAK-4279-checkpoints.patch, 
> OAK-4279-v0.patch, OAK-4279-v1.patch, OAK-4279-v2.patch, OAK-4279-v3.patch, 
> OAK-4279-v4.patch
>
>
> The fix for OAK-3348 broke some of the previous functionality of offline 
> compaction:
> * No more progress logging
> * Compaction is not interruptible any more (in the sense of OAK-3290)
> * Offline compaction could remove the ids of the segment node states to 
> squeeze out some extra space. Those are only needed for later generations 
> generated via online compaction. 
> We should probably implement offline compaction again through a dedicated 
> {{Compactor}} class as it was done in {{oak-segment}} instead of relying on 
> the de-duplication cache (aka online compaction). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to