[
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568756#comment-17568756
]
Michael McCandless commented on LUCENE-10216:
---------------------------------------------
I think we can backport to 9.x? But we should not rush it for 9.3. It has
baked for quite a while in {{main}} and [~vigyas] has fixed some followon build
failures.
> Add concurrency to addIndexes(CodecReader…) API
> -----------------------------------------------
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Vigya Sharma
> Priority: Major
> Time Spent: 11h 10m
> Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the
> e-commerce platform. I’m working on a project that involves applying
> metadata+ETL transforms and indexing documents on n different _indexing_
> boxes, combining them into a single index on a separate _reducer_ box, and
> making it available for queries on m different _search_ boxes (replicas).
> Segments are asynchronously copied from indexers to reducers to searchers as
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the
> reducer boxes. Since we also have taxonomy data, we need to remap facet field
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments
> with new ordinal values while also merging all provided segments in the
> process.
> _This is however a blocking call that runs in a single thread._ Until we have
> written segments with new ordinal values, we cannot copy them to searcher
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges,
> each with only a single reader, creating a concurrently running 1:1
> conversion from old segments to new ones (with new ordinal values). We follow
> this up with non-blocking background merges. This lets us copy the segments
> to searchers and replicas as soon as they are available, and later replace
> them with merged segments as background jobs complete. On the Amazon dataset
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time.
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}}
> API with a {{boolean}} flag for concurrency, that internally submits multiple
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}},
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting
> multiple addIndexes() calls and waiting for them to complete, I felt it needs
> some understanding of what addIndexes does, why you need to wait on the merge
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]