[jira] [Comment Edited] (OAK-6353) Use Document order traversal for reindexing performed on DocumentNodeStore setups

Chetan Mehrotra (JIRA) Sun, 17 Dec 2017 22:45:19 -0800

    [ 
https://issues.apache.org/jira/browse/OAK-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294556#comment-16294556
 ]


Chetan Mehrotra edited comment on OAK-6353 at 12/18/17 6:44 AM:
----------------------------------------------------------------

With new Document order traversal based indexing significant performance 
improvements were seen. 

For a large repo (255M Mongo Docs, 66M nodes under /content and having 4.2M 
assets) earlier indexing completed in 13.66 h. Compared to that document order 
based indexing completed in 3.469 h. 

With this initial planned implementation is done. Specific issues can later be 
opened for further improvements. Possible future enhancements

# Prefetch the previous documents before doing Mongo traversal - This may 
reduce the time to resolve the NodeDocument to NodeState
# Mongo query optimizations
## Avoid fetching nodes under hidden paths at all
## Only fetch those documents from Mongo which are under included paths - This 
can be done by using javascript function
# Sorting optimization - Sort the batch in memory as nodes are being read and 
just write the sorted files

*Usage*

This mode can be enabled for Mongo based setup via cli argument 
{{--doc-traversal-mode}}

This indexing mode requires quite a bit of local disk space to store all the 
NodeState in json format. For 200GB Mongo repo it required 100GB of local disk 
space to keep the NodeState json and also for performing external sort on that

Also documents need to be updated


was (Author: chetanm):
With new Document order traversal based indexing significant performance 
improvements were seen. 

For a large repo (255M Mongo Docs, 66M nodes under /content and having 4.2M 
assets) earlier indexing completed in 13.66 h. Compared to that document order 
based indexing completed in 3.469 h. 

With this initial planned implementation is done. Specific issues can later be 
opened for further improvements. Possible future enhancements

# Prefetch the previous documents before doing Mongo traversal - This may 
reduce the time to resolve the NodeDocument to NodeState
# Mongo query optimizations
## Avoid fetching nodes under hidden paths at all
## Only fetch those documents from Mongo which are under included paths - This 
can be done by using javascript function
# Sorting optimization - Sort the batch in memory as nodes are being read and 
just write the sorted files

Also documents need to be updated

> Use Document order traversal for reindexing performed on DocumentNodeStore 
> setups
> ---------------------------------------------------------------------------------
>
>                 Key: OAK-6353
>                 URL: https://issues.apache.org/jira/browse/OAK-6353
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: run
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.7.13, 1.8
>
>         Attachments: OAK-6353-v1.patch, OAK-6353-v2.patch
>
>
> [~tmueller] suggested 
> [here|https://issues.apache.org/jira/browse/OAK-6246?focusedCommentId=16034442&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16034442]
>  that document order traversal can be faster compared to current mode of path 
> based traversal. Initial test indicate that such a traversal can be order of 
> magnitude faster. 
> So this task is meant to implement such an approach and see if it can be a 
> viable indexing mode used for DocumentNodeStore based setups



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (OAK-6353) Use Document order traversal for reindexing performed on DocumentNodeStore setups

Reply via email to