After some more reading, the NoMergePolicy seems to mostly solve my problem.

I've configured my IndexWriterConfig with:

    .setMaxBufferedDocs(Integer.MAX_VALUE)
    .setRAMBufferSizeMB(Double.MAX_VALUE)
    .setMergePolicy(NoMergePolicy.INSTANCE)

With this config I consistently end up with a number of segments that is a
multiple of the number of processors on the indexing VM. I don't have to
force merge at all. This also makes the indexing job faster overall.

I think I was previously confused by the behavior of the
ConcurrentMergeScheduler. I'm sure it's great for most use-cases, but I
really need to just move as many docs as possible as fast as possible to a
predictable number of segments, so the NoMergePolicy seems to be a good
choice for my use-case.

Also, I learned a lot from Uwe's recent talk at Berlin Buzzwords
<https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf>,
and his great post about MMapDirectory from a few years ago
<https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html>.
Definitely recommended for others.

Thanks,
Alex

On Mon, Jul 5, 2021 at 1:53 PM Alex K <aklib...@gmail.com> wrote:

> Ok, so it sounds like if you want a very specific number of segments you
> have to do a forceMerge at some point?
>
> Is there some simple summary on how segments are formed in the first
> place? Something like, "one segment is created every time you flush from an
> IndexWriter"? Based on some experimenting and reading the code, it seems to
> be quite complicated, especially once you start calling addDocument from
> several threads in parallel.
>
> It's good to learn about the MultiReader. I'll look into that some more.
>
> Thanks,
> Alex
>
> On Mon, Jul 5, 2021 at 9:14 AM Uwe Schindler <u...@thetaphi.de> wrote:
>
>> If you want an exact number of segments, create 64 indexes, each
>> forceMerged to one segment.
>> After that use MultiReader to create a view on all separate indexes.
>> MultiReaders's contents are always flattened to a list of those 64 indexes.
>>
>> But keep in mind that this should only ever be done with *static*
>> indexes. As soon as you have updates, this is a bad idea (forceMerge in
>> general) and also splitting indexes like this. Parallelization should
>> normally come from multiple queries running in parallel, but you shouldn't
>> force Lucene to run a single query over so many indexes.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> https://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>> > -----Original Message-----
>> > From: Alex K <aklib...@gmail.com>
>> > Sent: Monday, July 5, 2021 4:04 AM
>> > To: java-user@lucene.apache.org
>> > Subject: Control the number of segments without using forceMerge.
>> >
>> > Hi all,
>> >
>> > I'm trying to figure out if there is a way to control the number of
>> > segments in an index without explicitly calling forceMerge.
>> >
>> > My use-case looks like this: I need to index a static dataset of ~1
>> > billion documents. I know the exact number of docs before indexing
>> starts.
>> > I know the VM where this index is searched has 64 threads. I'd like to
>> end
>> > up with exactly 64 segments, so I can search them in a parallelized
>> fashion.
>> >
>> > I know that I could call forceMerge(64), but this takes an extremely
>> long
>> > time.
>> >
>> > Is there a straightforward way to ensure that I end up with 64 threads
>> > without force-merging after adding all of the documents?
>> >
>> > Thanks in advance for any tips
>> >
>> > Alex Klibisz
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

Reply via email to