On Fri, Apr 7, 2023 at 5:13 PM Benjamin Trent <ben.w.tr...@gmail.com> wrote: > > From all I have seen when hooking up JFR when indexing a medium number of > vectors(1M +), almost all the time is spent simply comparing the vectors > (e.g. dot_product). > > This indicates to me that another algorithm won't really help index build > time tremendously. Unless others do dramatically fewer vector comparisons > (from what I can tell, this is at least not true for DiskAnn, unless some > fancy footwork is done when building the PQ codebook). > > I would also say comparing vector index build time to indexing terms are > apples and oranges. Yeah, they both live in Lucene, but the number of > calculations required (no matter the data structure used), will be magnitudes > greater. >
I'm not sure, i think this slowness due to massive number of comparisons is just another side effect of the unscalable algorithm. It is designed to build an in-memory datastructure and "merge" means "rebuild". And since we fully rebuild a new one when merging, you get something like O(n^2) total indexing when you take merges into account. Some of the other algorithms... in fact support merging. The DiskANN paper has like a "chapter" on this. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org