[ 
https://issues.apache.org/jira/browse/OAK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312095#comment-15312095
 ] 

Alex Parvulescu commented on OAK-4279:
--------------------------------------

{quote}
Do we need to de-duplicate binaries through Map<String, List<RecordId>> 
binaries? After all there is some de-duplication already happening in 
SegmentWriter.SegmentWriteOperation#writeStream. That method tries to extract 
the list of ids of the bulks of a stream and just rewrites that list instead of 
all bulks. Obviously this doesn't help if the same blob is already persisted 
more than once into different bulks. However I wonder how often this actually 
happens in practice. At least I think this is very unlikely to happen for large 
blobs.
{quote}

I'm going to add some tests here. If I understand correctly you concern is 
about trading extra IO for a more performant de-duplication mechanism.
I'd like to hide this heavy check under a dedicated flag (off by default), so 
we still have the option to turn on full de-duplication of binaries if needed 
(even if it's testing only).

bq. So maybe we should only do the extra de-duplication in the compactor only 
for small blobs shaving off quite a bit of IO and CPU cycles now spent in fully 
reading all those blobs for comparing them.
hmm, there is no de-duplication happening for small blobs in the compactor 
itself (other than the record cache), so I'm not sure what you mean here.
For the 'quite a bit of IO' parts: this is the de-duplication key: 
{{blob.length() + ":" + Hashing.sha1().hashBytes(buffer, 0, n);}}. how many 
large blobs have the same length and same 4k header (buffer size is 4k)? those 
are the ones you are worried about comparing, so I'm not convinced this is such 
a huge gain, but as I mentioned I'm open to hiding this behind a flag, disabled 
by default.

{quote}
Just noted that binaries are also put into that RecordCache. I think this 
definitely not necessary as those are de-duplicated via the mechanism described 
in the first paragraph.
{quote}
I don't fully get the concern. If we want to minimize IO, then it makes sense 
to cache the mapping between compacted binary records. What is the benefit of 
going from 'no IO' (direct mapping cached) to 'some IO' (extracting recordids 
from the stream).

> Rework offline compaction
> -------------------------
>
>                 Key: OAK-4279
>                 URL: https://issues.apache.org/jira/browse/OAK-4279
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: segment-tar
>            Reporter: Michael Dürig
>            Assignee: Alex Parvulescu
>            Priority: Blocker
>              Labels: compaction, gc
>             Fix For: 1.6
>
>         Attachments: OAK-4279-checkpoints.patch, OAK-4279-v0.patch, 
> OAK-4279-v1.patch, OAK-4279-v2.patch, OAK-4279-v3.patch, OAK-4279-v4.patch
>
>
> The fix for OAK-3348 broke some of the previous functionality of offline 
> compaction:
> * No more progress logging
> * Compaction is not interruptible any more (in the sense of OAK-3290)
> * Offline compaction could remove the ids of the segment node states to 
> squeeze out some extra space. Those are only needed for later generations 
> generated via online compaction. 
> We should probably implement offline compaction again through a dedicated 
> {{Compactor}} class as it was done in {{oak-segment}} instead of relying on 
> the de-duplication cache (aka online compaction). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to