These JIRA issues contain results against two ann-benchmarks datasets. It'd be great to get your thoughts/ feedback if you have any: * Searching: https://issues.apache.org/jira/browse/LUCENE-9937 * Indexing: https://issues.apache.org/jira/browse/LUCENE-9941
The benchmarks are based on the setup here: https://github.com/jtibshirani/lucene/pull/1. I am happy to help if you run into issues with it. A note: my motivation for running ann-benchmarks was to understand how the current performance compares to other approaches, and to research ideas for improvements. The setup in the PR doesn't feel solid/ maintainable as a long term approach to development benchmarks. My personal plan is to focus on enhancing luceneutil and our nightly benchmarks ( https://github.com/mikemccand/luceneutil) instead of putting a lot of effort into the ann-benchmarks setup. Julie On Wed, May 26, 2021 at 1:04 PM Alex K <aklib...@gmail.com> wrote: > Thanks Michael. IIRC, the thing that was taking so long was merging into a > single segment. Is there already benchmarking code for HNSW > available somewhere? I feel like I remember someone posting benchmarking > results on one of the Jira tickets. > > Thanks, > Alex > > On Wed, May 26, 2021 at 3:41 PM Michael Sokolov <msoko...@gmail.com> > wrote: > > > This java implementation will be slower than the C implementation. I > > believe the algorithm is essentially the same, however this is new and > > there may be bugs! I (and I think Julie had similar results IIRC) > > measured something like 8x slower than hnswlib (using ann-benchmarks). > > It is also surprising (to me) though how this varies with > > differently-learned vectors so YMMV. I still think there is value > > here, and look forward to improved performance, especially as JDK16 > > has some improved support for vectorized instructions. > > > > Please also understand that the HNSW algorithm interacts with Lucene's > > segmented architecture in a tricky way. Because we built a graph > > *per-segment* when flushing/merging, these must be rebuilt whenever > > segments are merged. So your indexing performance can be heavily > > influenced by how often you flush, as well as by your merge policy > > settings. Also, when searching, there is a bigger than usual benefit > > for searching across fewer segments, since the cost of searching an > > HNSW graph scales more or less with log N (so searching a single large > > graph is cheaper than searching the same documents divided among > > smaller graphs). So I do recommend using a multithreaded collector in > > order to get best latency with HNSW-based search. To get the best > > indexing, and searching, performance, you should generally index as > > large a number of documents as possible before flushing. > > > > -Mike > > > > On Wed, May 26, 2021 at 9:43 AM Michael Wechner > > <michael.wech...@wyona.com> wrote: > > > > > > Hi Alex > > > > > > Thank you very much for your feedback and the various insights! > > > > > > Am 26.05.21 um 04:41 schrieb Alex K: > > > > Hi Michael and others, > > > > > > > > Sorry just now getting back to you. For your three original > questions: > > > > > > > > - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a > > > > thorough response. > > > > - As far as I know Opendistro is calling out to a C/C++ binary to run > > the > > > > actual HNSW algorithm and store the HNSW part of the index. When they > > > > implemented it about a year ago, Lucene did not have this yet. I > > assume the > > > > Lucene HNSW implementation is solid, but would not be surprised if > it's > > > > slower than the C/C++ based implementation, given the JVM has some > > > > disadvantages for these kinds of CPU-bound/number crunching algos. > > > > - I just haven't had much time to invest into my benchmark recently. > In > > > > particular, I got stuck on why indexing was taking extremely long. > Just > > > > indexing the vectors would have easily exceeded the current time > > > > limitations in the ANN-benchmarks project. Maybe I had some naive > > mistake > > > > in my implementation, but I profiled and dug pretty deep to make it > > fast. > > > > > > I am trying to get Julie's branch running > > > > > > https://github.com/jtibshirani/lucene/tree/hnsw-bench > > > > > > Maybe this will help and is comparable > > > > > > > > > > > > > > I'm assuming you want to use Lucene, but not necessarily via > > Elasticsearch? > > > > > > Yes, for more simple setups I would like to use Lucene standalone, but > > > for setups which have to scale I would use either Elasticsearch or > Solr. > > > > > > Thanks > > > > > > Michael > > > > > > > > > > > > > If so, another option you might try for ANN is the elastiknn-models > > > > and elastiknn-lucene packages. elastiknn-models contains the Locality > > > > Sensitive Hashing implementations of ANN used by Elastiknn, and > > > > elastiknn-lucene contains the Lucene queries used by Elastiknn.The > > Lucene > > > > query is the MatchHashesAndScoreQuery > > > > < > > > https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22 > > >. > > > > There are a couple of scala test suites that show how to use it: > > > > MatchHashesAndScoreQuerySuite > > > > < > > > https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala > > >. > > > > MatchHashesAndScoreQueryPerformanceSuite > > > > < > > > https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala > > >. > > > > This is all designed to work independently from Elasticsearch and is > > > > published on Maven: com.klibisz.elastiknn / lucene > > > > < > > > https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar > > > > > > > and > > > > com.klibisz.elastiknn / models > > > > < > > > https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar > > >. > > > > The tests are Scala but all of the implementation is in Java. > > > > > > > > Thanks, > > > > Alex > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >