[
https://issues.apache.org/jira/browse/OAK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388524#comment-17388524
]
Amrit Verma commented on OAK-9434:
----------------------------------
Configurations added -
*Sort strategy type* -
[https://github.com/apache/jackrabbit-oak/blob/1621b9d56434ee4a6f2cd19863f94d963d68ac91/oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/FlatFileNodeStoreBuilder.java#L53].
|
Example test -
[https://github.com/apache/jackrabbit-oak/blob/1621b9d56434ee4a6f2cd19863f94d963d68ac91/oak-run-commons/src/test/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/FlatFileStoreTest.java#L102]
*Thread pool size for parallel download* -
[https://github.com/apache/jackrabbit-oak/blob/1621b9d56434ee4a6f2cd19863f94d963d68ac91/oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/MultithreadedTraverseWithSortStrategy.java#L326]
*Existing data dump dir (to resume from where previous download stopped)* -
[https://github.com/apache/jackrabbit-oak/blob/1621b9d56434ee4a6f2cd19863f94d963d68ac91/oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/IndexOptions.java#L106-L108]
- This option, if specified, should point to the flat file store directory in
the indexing work dir - See example test case -
[https://github.com/apache/jackrabbit-oak/blob/1621b9d56434ee4a6f2cd19863f94d963d68ac91/oak-run-commons/src/test/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/FlatFileStoreTest.java#L175]
> MongoDB indexing: implement parallel chunk download
> ---------------------------------------------------
>
> Key: OAK-9434
> URL: https://issues.apache.org/jira/browse/OAK-9434
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: indexing
> Affects Versions: 1.38.0
> Reporter: Amrit Verma
> Assignee: Amrit Verma
> Priority: Major
>
> In case of large indexes, indexing takes a long time. In case of MongoDB
> Document store, Currently it is a two step process - download the data from
> mongodb then create index based on that data.
> If something fails during this process, indexing needs to be restarted from
> beginning of the download step. We should make the indexing process resumable
> from the point it stopped.
> Since data download from mongodb seems to be more time taking than indexing
> itself, we first focus on download part.
> This Jira issue is for implementing resumable/parallel download.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)