[jira] [Comment Edited] (OAK-11991) Optimize the oak-segment recovery process

Ieran Draghiciu (Jira) Mon, 03 Nov 2025 04:41:05 -0800


    [ 
https://issues.apache.org/jira/browse/OAK-11991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035037#comment-18035037
 ]


Ieran Draghiciu edited comment on OAK-11991 at 11/3/25 12:40 PM:
-----------------------------------------------------------------

With [PR|https://github.com/apache/jackrabbit-oak/pull/2604] I introduced 
copying blobs in batches (1000 blob/batch):
- split the blobs in batches (each batch has 1000 blobs, last blobs will be 
handled in separat batch)
- start copy each batch
- start to check the copy status. the advantage here is that by the time we 
check the first blob it should already be copied and not wait for blob to be 
copied.

Also, batch approach works faster because of Azure optimization:
- Azure Storage is heavily optimized for concurrent PUTs (write-heavy 
patterns). Azure’s internal server-side copy pipeline
- The service can queue and parallelize these internal copy operations 
efficiently.
- But when you interleave PUTs with HEADs (reads), Azure’s load balancers and 
caching layers can’t optimize as effectively — you’re forcing read-after-write 
consistency checks repeatedly.
Azure is more efficient when doing “pure” PUT operations (writes) or pure HEAD 
operations (reads) in bulk, rather than interleaving them.

I did testing on an env with 3717 and the copy was done in ~1min2sec.

{code:java}
(recovery process starts)
03.11.2025 10:13:58.251 *WARN* [segmentstore-init-6] 
org.apache.jackrabbit.oak.segment.file.tar.TarReader Could not find a valid tar 
index in [data00010a.tar], recovering...
03.11.2025 10:13:58.252 *INFO* [segmentstore-init-6] 
org.apache.jackrabbit.oak.segment.file.tar.TarReader Recovering segments from 
tar file data00010a.tar

(lists all blobs)
...
03.11.2025 10:13:58.822 *INFO* [reactor-http-nio-2] 
org.apache.jackrabbit.oak.segment.azure.AzureHttpRequestLoggingPolicy HTTP 
Request: GET 
https://sa01394020shared0925dfef.blob.core.windows.net/aem-sgmt-fbe8b722af60268e1d58106abf6f4a4522c5d382-000004/aem%2Fdata00010a.tar%2F0000.78fabc53-2ced-4f0a-a7b1-d86b36bd9aee
 200 9ms
....
03.11.2025 10:14:20.179 *INFO* [reactor-http-nio-3] 
org.apache.jackrabbit.oak.segment.azure.AzureHttpRequestLoggingPolicy HTTP 
Request: GET 
https://sa01394020shared0925dfef.blob.core.windows.net/aem-sgmt-fbe8b722af60268e1d58106abf6f4a4522c5d382-000004/aem%2Fdata00010a.tar%2F0e84.ef629d76-d7c2-4d6a-a1ee-0f483c7c5256
 200 4ms
03.11.2025 10:14:20.181 *INFO* [segmentstore-init-6] 
org.apache.jackrabbit.oak.segment.azure.AzureArchiveManager Recovering segment 
data00010a.tar/0000.78fabc53-2ced-4f0a-a7b1-d86b36bd9aee
...

(recover blobs)
03.11.2025 10:14:20.263 *INFO* [segmentstore-init-6] 
org.apache.jackrabbit.oak.segment.azure.AzureArchiveManager Recovering segment 
data00010a.tar/0e84.ef629d76-d7c2-4d6a-a1ee-0f483c7c5256
...

(copy blobs to bak)
03.11.2025 10:14:23.804 *INFO* [segmentstore-init-6] 
org.apache.jackrabbit.oak.segment.azure.AzureArchiveManager Start copy 3717 
blobs to aem/data00010a.tar.29.bak/
03.11.2025 10:14:23.829 *INFO* [reactor-http-nio-1] 
org.apache.jackrabbit.oak.segment.azure.AzureHttpRequestLoggingPolicy HTTP 
Request: PUT 
https://sa01394020shared0925dfef.blob.core.windows.net/aem-sgmt-fbe8b722af60268e1d58106abf6f4a4522c5d382-000004/aem%2Fdata00010a.tar.29.bak%2F0000.78fabc53-2ced-4f0a-a7b1-d86b36bd9aee
 202 11ms
....
03.11.2025 10:15:25.500 *INFO* [reactor-http-nio-3] 
org.apache.jackrabbit.oak.segment.azure.AzureHttpRequestLoggingPolicy HTTP 
Request: HEAD 
https://sa01394020shared0925dfef.blob.core.windows.net/aem-sgmt-fbe8b722af60268e1d58106abf6f4a4522c5d382-000004/aem%2Fdata00010a.tar.29.bak%2F0e84.ef629d76-d7c2-4d6a-a1ee-0f483c7c5256
 200 3ms
...
1 min for 3717 blobs
{code}



was (Author: ierandra):
With [PR|https://github.com/apache/jackrabbit-oak/pull/2604] I introduced 
copying blobs in batches (1000 blob/batch):
- split the blobs in batches (each batch has 1000 blobs, last blobs will be 
handled in separat batch)
- start copy each batch
- start to check the copy status. the advantage here is that by the time we 
check the first blob it should already be copied and not wait for blob to be 
copied.

I did testing on an env with 3717 and the copy was done in ~1min2sec.

{code:java}
(recovery process starts)
03.11.2025 10:13:58.251 *WARN* [segmentstore-init-6] 
org.apache.jackrabbit.oak.segment.file.tar.TarReader Could not find a valid tar 
index in [data00010a.tar], recovering...
03.11.2025 10:13:58.252 *INFO* [segmentstore-init-6] 
org.apache.jackrabbit.oak.segment.file.tar.TarReader Recovering segments from 
tar file data00010a.tar

(lists all blobs)
...
03.11.2025 10:13:58.822 *INFO* [reactor-http-nio-2] 
org.apache.jackrabbit.oak.segment.azure.AzureHttpRequestLoggingPolicy HTTP 
Request: GET 
https://sa01394020shared0925dfef.blob.core.windows.net/aem-sgmt-fbe8b722af60268e1d58106abf6f4a4522c5d382-000004/aem%2Fdata00010a.tar%2F0000.78fabc53-2ced-4f0a-a7b1-d86b36bd9aee
 200 9ms
....
03.11.2025 10:14:20.179 *INFO* [reactor-http-nio-3] 
org.apache.jackrabbit.oak.segment.azure.AzureHttpRequestLoggingPolicy HTTP 
Request: GET 
https://sa01394020shared0925dfef.blob.core.windows.net/aem-sgmt-fbe8b722af60268e1d58106abf6f4a4522c5d382-000004/aem%2Fdata00010a.tar%2F0e84.ef629d76-d7c2-4d6a-a1ee-0f483c7c5256
 200 4ms
03.11.2025 10:14:20.181 *INFO* [segmentstore-init-6] 
org.apache.jackrabbit.oak.segment.azure.AzureArchiveManager Recovering segment 
data00010a.tar/0000.78fabc53-2ced-4f0a-a7b1-d86b36bd9aee
...

(recover blobs)
03.11.2025 10:14:20.263 *INFO* [segmentstore-init-6] 
org.apache.jackrabbit.oak.segment.azure.AzureArchiveManager Recovering segment 
data00010a.tar/0e84.ef629d76-d7c2-4d6a-a1ee-0f483c7c5256
...

(copy blobs to bak)
03.11.2025 10:14:23.804 *INFO* [segmentstore-init-6] 
org.apache.jackrabbit.oak.segment.azure.AzureArchiveManager Start copy 3717 
blobs to aem/data00010a.tar.29.bak/
03.11.2025 10:14:23.829 *INFO* [reactor-http-nio-1] 
org.apache.jackrabbit.oak.segment.azure.AzureHttpRequestLoggingPolicy HTTP 
Request: PUT 
https://sa01394020shared0925dfef.blob.core.windows.net/aem-sgmt-fbe8b722af60268e1d58106abf6f4a4522c5d382-000004/aem%2Fdata00010a.tar.29.bak%2F0000.78fabc53-2ced-4f0a-a7b1-d86b36bd9aee
 202 11ms
....
03.11.2025 10:15:25.500 *INFO* [reactor-http-nio-3] 
org.apache.jackrabbit.oak.segment.azure.AzureHttpRequestLoggingPolicy HTTP 
Request: HEAD 
https://sa01394020shared0925dfef.blob.core.windows.net/aem-sgmt-fbe8b722af60268e1d58106abf6f4a4522c5d382-000004/aem%2Fdata00010a.tar.29.bak%2F0e84.ef629d76-d7c2-4d6a-a1ee-0f483c7c5256
 200 3ms
...
1 min for 3717 blobs
{code}


> Optimize the oak-segment recovery process
> -----------------------------------------
>
>                 Key: OAK-11991
>                 URL: https://issues.apache.org/jira/browse/OAK-11991
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: segment-azure, segment-tar
>            Reporter: Ieran Draghiciu
>            Priority: Major
>
> Tar archives with many segment files (more then 10.000) takes to much to 
> recover. Investigate and implement solution to optimize this process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (OAK-11991) Optimize the oak-segment recovery process

Reply via email to