After some more reading, the NoMergePolicy seems to mostly solve my problem.
I've configured my IndexWriterConfig with: .setMaxBufferedDocs(Integer.MAX_VALUE) .setRAMBufferSizeMB(Double.MAX_VALUE) .setMergePolicy(NoMergePolicy.INSTANCE) With this config I consistently end up with a number of segments that is a multiple of the number of processors on the indexing VM. I don't have to force merge at all. This also makes the indexing job faster overall. I think I was previously confused by the behavior of the ConcurrentMergeScheduler. I'm sure it's great for most use-cases, but I really need to just move as many docs as possible as fast as possible to a predictable number of segments, so the NoMergePolicy seems to be a good choice for my use-case. Also, I learned a lot from Uwe's recent talk at Berlin Buzzwords <https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf>, and his great post about MMapDirectory from a few years ago <https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html>. Definitely recommended for others. Thanks, Alex On Mon, Jul 5, 2021 at 1:53 PM Alex K <aklib...@gmail.com> wrote: > Ok, so it sounds like if you want a very specific number of segments you > have to do a forceMerge at some point? > > Is there some simple summary on how segments are formed in the first > place? Something like, "one segment is created every time you flush from an > IndexWriter"? Based on some experimenting and reading the code, it seems to > be quite complicated, especially once you start calling addDocument from > several threads in parallel. > > It's good to learn about the MultiReader. I'll look into that some more. > > Thanks, > Alex > > On Mon, Jul 5, 2021 at 9:14 AM Uwe Schindler <u...@thetaphi.de> wrote: > >> If you want an exact number of segments, create 64 indexes, each >> forceMerged to one segment. >> After that use MultiReader to create a view on all separate indexes. >> MultiReaders's contents are always flattened to a list of those 64 indexes. >> >> But keep in mind that this should only ever be done with *static* >> indexes. As soon as you have updates, this is a bad idea (forceMerge in >> general) and also splitting indexes like this. Parallelization should >> normally come from multiple queries running in parallel, but you shouldn't >> force Lucene to run a single query over so many indexes. >> >> Uwe >> >> ----- >> Uwe Schindler >> Achterdiek 19, D-28357 Bremen >> https://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> > -----Original Message----- >> > From: Alex K <aklib...@gmail.com> >> > Sent: Monday, July 5, 2021 4:04 AM >> > To: java-user@lucene.apache.org >> > Subject: Control the number of segments without using forceMerge. >> > >> > Hi all, >> > >> > I'm trying to figure out if there is a way to control the number of >> > segments in an index without explicitly calling forceMerge. >> > >> > My use-case looks like this: I need to index a static dataset of ~1 >> > billion documents. I know the exact number of docs before indexing >> starts. >> > I know the VM where this index is searched has 64 threads. I'd like to >> end >> > up with exactly 64 segments, so I can search them in a parallelized >> fashion. >> > >> > I know that I could call forceMerge(64), but this takes an extremely >> long >> > time. >> > >> > Is there a straightforward way to ensure that I end up with 64 threads >> > without force-merging after adding all of the documents? >> > >> > Thanks in advance for any tips >> > >> > Alex Klibisz >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>