[jira] [Commented] (OAK-1392) SegmentBlob.equals() optimization

Jukka Zitting (JIRA) Mon, 17 Feb 2014 08:18:45 -0800

    [ 
https://issues.apache.org/jira/browse/OAK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903349#comment-13903349
 ]


Jukka Zitting commented on OAK-1392:
------------------------------------

bq. if it makes sense to cache just the last one that was loaded

It might, but then again it might not.

My thinking here is that the bigger binaries (>16kB) that get stored in bulk 
segments are typically accessed much less frequently than other content. And 
the accessed that do occur are normally sequential in nature, so streaming them 
directly should be reasonably efficient. Also, caching just one segment would 
be troublesome as due to the size and nature of binaries it would be somewhat 
unlikely for that same segment to be reused before some other bulk segment (for 
example the next segment in a binary larger than 256kB) is accessed.

A scan-resistant cache of bulk segments might work, but I'd only consider one 
if there's a real-world benchmark that shows that the benefit is worth the 
extra complexity.

> SegmentBlob.equals() optimization
> ---------------------------------
>
>                 Key: OAK-1392
>                 URL: https://issues.apache.org/jira/browse/OAK-1392
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: core
>            Reporter: Jukka Zitting
>         Attachments: 0001-OAK-1392-SegmentBlob.equals-optimization.patch, 
> OAK-1392-v0.patch
>
>
> The current {{SegmentBlob.equals()}} method only checks for reference 
> equality before falling back to the {{AbstractBlob.equals()}} method that 
> just scans the entire byte stream.
> This works well for the majority of cases where a binary won't change at all 
> or at least not often. However, there are some cases where a client 
> frequently updates a binary or even rewrites it with the exact same contents. 
> We should optimize the handling of also those cases.
> Some ideas on different things we can/should do:
> # Make {{AbstractBlob.equals()}} compare the blob lengths before scanning the 
> byte streams. If a blob has changed it's length is likely also different, in 
> which case the length check should provide a quick shortcut.
> # Keep a simple checksum like Adler-32 along with medium-sized value records 
> and the block record references of a large value record. Compare those 
> checksums before falling back to a full byte scan. This should capture 
> practically all cases where the binaries are different even with equal 
> lengths, but still not the case where they're equal.
> # When updating a binary value, do an equality check with the previous value 
> and reuse the previous value if equal. The extra cost of doing this should 
> get recovered already when the commit hooks that look at the change won't 
> have to consider an unchanged binary.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (OAK-1392) SegmentBlob.equals() optimization

Reply via email to