[
https://issues.apache.org/jira/browse/OAK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Parvulescu updated OAK-1392:
---------------------------------
Attachment: OAK-1392-v0.patch
attaching WIP patch for review.
I've copied some bits from the NodeStoreKernel to compute the checksum on save.
As proposed I added the length check on the AbstractBlob, but that is no longer
used by this patch as I changed the SegmentBlob equals check a bit, but I think
that is a good check to have in general so I left it in the patch.
I couldn't manage to implement #3, at save time I see when the update happen
and I can't figure out how to match old binaries with new binaries.
For example the lucene index saves the files split over multiple slices as a
binary list property. Then an update comes along and I can't tell if this list
is stable, of if maybe the items got shuffled and the indices changed but the
overall entry didn't.
> SegmentBlob.equals() optimization
> ---------------------------------
>
> Key: OAK-1392
> URL: https://issues.apache.org/jira/browse/OAK-1392
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: core
> Reporter: Jukka Zitting
> Attachments: OAK-1392-v0.patch
>
>
> The current {{SegmentBlob.equals()}} method only checks for reference
> equality before falling back to the {{AbstractBlob.equals()}} method that
> just scans the entire byte stream.
> This works well for the majority of cases where a binary won't change at all
> or at least not often. However, there are some cases where a client
> frequently updates a binary or even rewrites it with the exact same contents.
> We should optimize the handling of also those cases.
> Some ideas on different things we can/should do:
> # Make {{AbstractBlob.equals()}} compare the blob lengths before scanning the
> byte streams. If a blob has changed it's length is likely also different, in
> which case the length check should provide a quick shortcut.
> # Keep a simple checksum like Adler-32 along with medium-sized value records
> and the block record references of a large value record. Compare those
> checksums before falling back to a full byte scan. This should capture
> practically all cases where the binaries are different even with equal
> lengths, but still not the case where they're equal.
> # When updating a binary value, do an equality check with the previous value
> and reuse the previous value if equal. The extra cost of doing this should
> get recovered already when the commit hooks that look at the change won't
> have to consider an unchanged binary.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)