[ 
https://issues.apache.org/jira/browse/OAK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13898021#comment-13898021
 ] 

Alex Parvulescu commented on OAK-1392:
--------------------------------------

I think it looks good.

What I find a bit confusing is the fact that I see no more hashing, neither 
full binaries nor block records.
Also is the length check out? Because that seems like a good idea to keep 
around.

I completely missed the block record comparison, that looks like a nice trick. 
So in fact the ByteStreams.equal() call already works with chunks of 4k, and 
this is already optimized at segment level in the case of a large binary (>16k).

bq. but it would be nice if we didn't introduce any functionality that would 
break our ability to do highly efficient in-place updates of binary values
agreed


> SegmentBlob.equals() optimization
> ---------------------------------
>
>                 Key: OAK-1392
>                 URL: https://issues.apache.org/jira/browse/OAK-1392
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: core
>            Reporter: Jukka Zitting
>         Attachments: 0001-OAK-1392-SegmentBlob.equals-optimization.patch, 
> OAK-1392-v0.patch
>
>
> The current {{SegmentBlob.equals()}} method only checks for reference 
> equality before falling back to the {{AbstractBlob.equals()}} method that 
> just scans the entire byte stream.
> This works well for the majority of cases where a binary won't change at all 
> or at least not often. However, there are some cases where a client 
> frequently updates a binary or even rewrites it with the exact same contents. 
> We should optimize the handling of also those cases.
> Some ideas on different things we can/should do:
> # Make {{AbstractBlob.equals()}} compare the blob lengths before scanning the 
> byte streams. If a blob has changed it's length is likely also different, in 
> which case the length check should provide a quick shortcut.
> # Keep a simple checksum like Adler-32 along with medium-sized value records 
> and the block record references of a large value record. Compare those 
> checksums before falling back to a full byte scan. This should capture 
> practically all cases where the binaries are different even with equal 
> lengths, but still not the case where they're equal.
> # When updating a binary value, do an equality check with the previous value 
> and reuse the previous value if equal. The extra cost of doing this should 
> get recovered already when the commit hooks that look at the change won't 
> have to consider an unchanged binary.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to