Tomek Rękawek created OAK-4751:
----------------------------------
Summary: Improve the checkpoint migration performance
Key: OAK-4751
URL: https://issues.apache.org/jira/browse/OAK-4751
Project: Jackrabbit Oak
Issue Type: Improvement
Components: segment-tar, upgrade
Reporter: Tomek Rękawek
(based on [~alex.parvulescu] input):
During the segment->segment-tar migration, a fair amount of time is being taken
by the deduplication process. Basically the repository is ingesting large
amounts of content (a checkpoint is the equivalent of a full repo state), and
once it deduplicates the data, it finds it already available in the destination
repository.
The reason this happens is because the diff mechanism cannot be efficient
across repositories.
For example: on the source repo we have r0 root state and cp0 a checkpoint very
close to r0. the diff(r0, cp0) is extremely cheap measured in milliseconds. But
what the sidegrade does is it copies r0 to the destination repository: r0 ->
rx1, then it runs diff(rx1, cp0) which becomes very expensive as the 2 node
states don't originate from the same repository, so diffing will fallback to a
slow content equals comparison. next the content is almost equal, so a huge
amount of cycles are wasted in deduplicating data over the 2 repositories.
I have no easy solution here other than looking into providing a diff mechanism
that will compare the 2 local states diff(r0, cp0) BUT apply the delta to the
destination repository (apply it on rx1). I'm not sure how easy this will turn
out to be, and if it's worth the effort.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)