That's great! And we were talking about this exactly here: https://github.com/apache/lucene/pull/12169
It would also help with the new token filter :) -------------------------- *Alessandro Benedetti* Director @ Sease Ltd. *Apache Lucene/Solr Committer* *Apache Solr PMC Member* e-mail: a.benede...@sease.io *Sease* - Information Retrieval Applied Consulting | Training | Open Source Website: Sease.io <http://sease.io/> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter <https://twitter.com/seaseltd> | Youtube <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github <https://github.com/seaseltd> On Thu, 27 Apr 2023 at 19:29, Jonathan Ellis <jbel...@gmail.com> wrote: > Hi all, > > I've created an HNSW index implementation that allows for concurrent build > and querying. On my i9-12900 (8 performance cores and 8 efficiency) I get > a bit less than 10x speedup of wall clock time for building and querying > the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/. > The small dataset is 10k vectors while the large is 1M. This speedup feels > pretty good for a data structure that isn't completely parallelizable, and > it's good to see that it's consistent as the dataset gets larger. > > The concurrent classes achieve identical recall compared to the > non-concurrent versions within my ability to test it, and are drop-in > replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals > to work around the places where the existing API assumes no concurrency. > > The concurrent classes also pass the existing test suite with the > exception of the ram usage ones; the estimator doesn't know about > AtomicReference etc. (Big thanks to Michael Sokolov for testAknnDiverse > which made it much easier to track down subtle problems!) > > My motivation is > > 1. It is faster to query a single on-heap hnsw index, than to query > multiple such indexes and combine the result. > 2. Even with some contention necessarily occurring during building of the > index, we still come out way ahead in terms of total efficiency vs creating > per-thread indexes and combining them, since combining such indexes boils > down to "pick the largest and then add all the other nodes normally," you > don't really benefit from having computed the others previously. > > I am currently adding this to Cassandra as code in our repo, but my > preference would be to upstream it. Is Lucene open to a pull request? > > -- > Jonathan Ellis > co-founder, http://www.datastax.com > @spyced >