I don't know about addIndexes. Does that let you say which document goes where somehow? Wouldn't you have to select a subset of documents from each originally indexed segment?
On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov <[email protected]> wrote: > I think the idea is to exert control over the distribution of documents > among the segments, in a deterministic reproducible way. > > On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <[email protected]> wrote: > >> Have you considered leveraging Lucene's built-in index sorting? It >> supports concurrent indexing and is quite fast. >> >> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <[email protected]> wrote: >> >>> Hi >>> Our team is seeking a way of construct (or rebuild) a deterministic >>> sorted index concurrently (I know lucene could achieve that in a sequential >>> manner but that might be too slow for us sometimes) >>> Currently we have roughly 2 ideas, all assuming there's a pre-built >>> index and have dumped a doc-segment map so that IndexWriter would be able >>> to be aware of which doc belong to which segment: >>> 1. First build index in the normal way (concurrently), after the index >>> is built, using "addIndexes" functionality to merge documents into the >>> correct segment. >>> 2. By controlling FlushPolicy and other related classes, make sure each >>> segment created (before merge) has only the documents that belong to one of >>> the segments in the pre-built index. And create a dedicated MergePolicy to >>> only merge segments belonging to one pre-built segment. >>> >>> Basically we think first one is easier to implement and second one is >>> faster. Want to seek some ideas & suggestions & feedback here. >>> >>> Thanks >>> Patrick Zhai >>> >> >> >> -- >> Adrien >> >
