>From all I have seen when hooking up JFR when indexing a medium number of vectors(1M +), almost all the time is spent simply comparing the vectors (e.g. dot_product).
This indicates to me that another algorithm won't really help index build time tremendously. Unless others do dramatically fewer vector comparisons (from what I can tell, this is at least not true for DiskAnn, unless some fancy footwork is done when building the PQ codebook). I would also say comparing vector index build time to indexing terms are apples and oranges. Yeah, they both live in Lucene, but the number of calculations required (no matter the data structure used), will be magnitudes greater. On Fri, Apr 7, 2023, 4:59 PM Robert Muir <rcm...@gmail.com> wrote: > On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov <msoko...@gmail.com> wrote: > > > > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994) > > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994) > > > > Robert, since you're the only on-the-record veto here, does this > > change your thinking at all, or if not could you share some test > > results that didn't go the way you expected? Maybe we can find some > > mitigation if we focus on a specific issue. > > > > My scale concerns are both space and time. What does the execution > time look like if you don't set insanely large IW rambuffer? The > default is 16MB. Just concerned we're shoving some problems under the > rug :) > > Even with the yuge RAMbuffer, we're still talking about almost 2 hours > to index 4M documents with these 2k vectors. Whereas you'd measure > this in seconds with typical lucene indexing, its nothing. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >