[
https://issues.apache.org/jira/browse/OAK-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294556#comment-16294556
]
Chetan Mehrotra edited comment on OAK-6353 at 12/18/17 6:44 AM:
----------------------------------------------------------------
With new Document order traversal based indexing significant performance
improvements were seen.
For a large repo (255M Mongo Docs, 66M nodes under /content and having 4.2M
assets) earlier indexing completed in 13.66 h. Compared to that document order
based indexing completed in 3.469 h.
With this initial planned implementation is done. Specific issues can later be
opened for further improvements. Possible future enhancements
# Prefetch the previous documents before doing Mongo traversal - This may
reduce the time to resolve the NodeDocument to NodeState
# Mongo query optimizations
## Avoid fetching nodes under hidden paths at all
## Only fetch those documents from Mongo which are under included paths - This
can be done by using javascript function
# Sorting optimization - Sort the batch in memory as nodes are being read and
just write the sorted files
*Usage*
This mode can be enabled for Mongo based setup via cli argument
{{--doc-traversal-mode}}
This indexing mode requires quite a bit of local disk space to store all the
NodeState in json format. For 200GB Mongo repo it required 100GB of local disk
space to keep the NodeState json and also for performing external sort on that
Also documents need to be updated
was (Author: chetanm):
With new Document order traversal based indexing significant performance
improvements were seen.
For a large repo (255M Mongo Docs, 66M nodes under /content and having 4.2M
assets) earlier indexing completed in 13.66 h. Compared to that document order
based indexing completed in 3.469 h.
With this initial planned implementation is done. Specific issues can later be
opened for further improvements. Possible future enhancements
# Prefetch the previous documents before doing Mongo traversal - This may
reduce the time to resolve the NodeDocument to NodeState
# Mongo query optimizations
## Avoid fetching nodes under hidden paths at all
## Only fetch those documents from Mongo which are under included paths - This
can be done by using javascript function
# Sorting optimization - Sort the batch in memory as nodes are being read and
just write the sorted files
Also documents need to be updated
> Use Document order traversal for reindexing performed on DocumentNodeStore
> setups
> ---------------------------------------------------------------------------------
>
> Key: OAK-6353
> URL: https://issues.apache.org/jira/browse/OAK-6353
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: run
> Reporter: Chetan Mehrotra
> Assignee: Chetan Mehrotra
> Fix For: 1.7.13, 1.8
>
> Attachments: OAK-6353-v1.patch, OAK-6353-v2.patch
>
>
> [~tmueller] suggested
> [here|https://issues.apache.org/jira/browse/OAK-6246?focusedCommentId=16034442&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16034442]
> that document order traversal can be faster compared to current mode of path
> based traversal. Initial test indicate that such a traversal can be order of
> magnitude faster.
> So this task is meant to implement such an approach and see if it can be a
> viable indexing mode used for DocumentNodeStore based setups
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)