Tomek Rękawek created OAK-4751:
----------------------------------

             Summary: Improve the checkpoint migration performance
                 Key: OAK-4751
                 URL: https://issues.apache.org/jira/browse/OAK-4751
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: segment-tar, upgrade
            Reporter: Tomek Rękawek


(based on [~alex.parvulescu] input):

During the segment->segment-tar migration, a fair amount of time is being taken 
by the deduplication process. Basically the repository is ingesting large 
amounts of content (a checkpoint is the equivalent of a full repo state), and 
once it deduplicates the data, it finds it already available in the destination 
repository.

The reason this happens is because the diff mechanism cannot be efficient 
across repositories.

For example: on the source repo we have r0 root state and cp0 a checkpoint very 
close to r0. the diff(r0, cp0) is extremely cheap measured in milliseconds. But 
what the sidegrade does is it copies r0 to the destination repository: r0 -> 
rx1, then it runs diff(rx1, cp0) which becomes very expensive as the 2 node 
states don't originate from the same repository, so diffing will fallback to a 
slow content equals comparison. next the content is almost equal, so a huge 
amount of cycles are wasted in deduplicating data over the 2 repositories.

I have no easy solution here other than looking into providing a diff mechanism 
that will compare the 2 local states diff(r0, cp0) BUT apply the delta to the 
destination repository (apply it on rx1). I'm not sure how easy this will turn 
out to be, and if it's worth the effort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to