Your summary sounds right to me. There are some ideas (being discussed on the issue), but I don't think we have a detailed understanding yet of the performance difference.
It would be great to get more eyes on the benchmark if you're interested in double-checking the results. Mike mentioned that he saw a similar performance difference in search (7-8x) when he ran his own benchmarks. Julie On Thu, May 27, 2021 at 12:55 AM Michael Wechner <michael.wech...@wyona.com> wrote: > Thank you very much for having done these benchmarks! > > IIUC one could state > > - Indexing: > Lucene is slower than hnswlib/C++, very roughly 10x performance > difference > - Searching (Queries per second): > Lucene is slower than hnswlib/C++, very roughly 8x performance > difference > > right, but we should double-check these results? > > Also it is not clear at the moment why there is this performance > difference, right? > > > Am 27.05.21 um 03:33 schrieb Julie Tibshirani: > > These JIRA issues contain results against two ann-benchmarks datasets. > It'd > > be great to get your thoughts/ feedback if you have any: > > * Searching: https://issues.apache.org/jira/browse/LUCENE-9937 > > * Indexing: https://issues.apache.org/jira/browse/LUCENE-9941 > > > > The benchmarks are based on the setup here: > > https://github.com/jtibshirani/lucene/pull/1. I am happy to help if you > run > > into issues with it. > > > > A note: my motivation for running ann-benchmarks was to understand how > the > > current performance compares to other approaches, and to research ideas > for > > improvements. The setup in the PR doesn't feel solid/ maintainable as a > > long term approach to development benchmarks. My personal plan is to > focus > > on enhancing luceneutil and our nightly benchmarks ( > > https://github.com/mikemccand/luceneutil) instead of putting a lot of > > effort into the ann-benchmarks setup. > > > > Julie > > > > On Wed, May 26, 2021 at 1:04 PM Alex K <aklib...@gmail.com> wrote: > > > >> Thanks Michael. IIRC, the thing that was taking so long was merging > into a > >> single segment. Is there already benchmarking code for HNSW > >> available somewhere? I feel like I remember someone posting benchmarking > >> results on one of the Jira tickets. > >> > >> Thanks, > >> Alex > >> > >> On Wed, May 26, 2021 at 3:41 PM Michael Sokolov <msoko...@gmail.com> > >> wrote: > >> > >>> This java implementation will be slower than the C implementation. I > >>> believe the algorithm is essentially the same, however this is new and > >>> there may be bugs! I (and I think Julie had similar results IIRC) > >>> measured something like 8x slower than hnswlib (using ann-benchmarks). > >>> It is also surprising (to me) though how this varies with > >>> differently-learned vectors so YMMV. I still think there is value > >>> here, and look forward to improved performance, especially as JDK16 > >>> has some improved support for vectorized instructions. > >>> > >>> Please also understand that the HNSW algorithm interacts with Lucene's > >>> segmented architecture in a tricky way. Because we built a graph > >>> *per-segment* when flushing/merging, these must be rebuilt whenever > >>> segments are merged. So your indexing performance can be heavily > >>> influenced by how often you flush, as well as by your merge policy > >>> settings. Also, when searching, there is a bigger than usual benefit > >>> for searching across fewer segments, since the cost of searching an > >>> HNSW graph scales more or less with log N (so searching a single large > >>> graph is cheaper than searching the same documents divided among > >>> smaller graphs). So I do recommend using a multithreaded collector in > >>> order to get best latency with HNSW-based search. To get the best > >>> indexing, and searching, performance, you should generally index as > >>> large a number of documents as possible before flushing. > >>> > >>> -Mike > >>> > >>> On Wed, May 26, 2021 at 9:43 AM Michael Wechner > >>> <michael.wech...@wyona.com> wrote: > >>>> Hi Alex > >>>> > >>>> Thank you very much for your feedback and the various insights! > >>>> > >>>> Am 26.05.21 um 04:41 schrieb Alex K: > >>>>> Hi Michael and others, > >>>>> > >>>>> Sorry just now getting back to you. For your three original > >> questions: > >>>>> - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a > >>>>> thorough response. > >>>>> - As far as I know Opendistro is calling out to a C/C++ binary to run > >>> the > >>>>> actual HNSW algorithm and store the HNSW part of the index. When they > >>>>> implemented it about a year ago, Lucene did not have this yet. I > >>> assume the > >>>>> Lucene HNSW implementation is solid, but would not be surprised if > >> it's > >>>>> slower than the C/C++ based implementation, given the JVM has some > >>>>> disadvantages for these kinds of CPU-bound/number crunching algos. > >>>>> - I just haven't had much time to invest into my benchmark recently. > >> In > >>>>> particular, I got stuck on why indexing was taking extremely long. > >> Just > >>>>> indexing the vectors would have easily exceeded the current time > >>>>> limitations in the ANN-benchmarks project. Maybe I had some naive > >>> mistake > >>>>> in my implementation, but I profiled and dug pretty deep to make it > >>> fast. > >>>> I am trying to get Julie's branch running > >>>> > >>>> https://github.com/jtibshirani/lucene/tree/hnsw-bench > >>>> > >>>> Maybe this will help and is comparable > >>>> > >>>> > >>>>> I'm assuming you want to use Lucene, but not necessarily via > >>> Elasticsearch? > >>>> Yes, for more simple setups I would like to use Lucene standalone, but > >>>> for setups which have to scale I would use either Elasticsearch or > >> Solr. > >>>> Thanks > >>>> > >>>> Michael > >>>> > >>>> > >>>> > >>>>> If so, another option you might try for ANN is the elastiknn-models > >>>>> and elastiknn-lucene packages. elastiknn-models contains the Locality > >>>>> Sensitive Hashing implementations of ANN used by Elastiknn, and > >>>>> elastiknn-lucene contains the Lucene queries used by Elastiknn.The > >>> Lucene > >>>>> query is the MatchHashesAndScoreQuery > >>>>> < > >> > https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22 > >>>> . > >>>>> There are a couple of scala test suites that show how to use it: > >>>>> MatchHashesAndScoreQuerySuite > >>>>> < > >> > https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala > >>>> . > >>>>> MatchHashesAndScoreQueryPerformanceSuite > >>>>> < > >> > https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala > >>>> . > >>>>> This is all designed to work independently from Elasticsearch and is > >>>>> published on Maven: com.klibisz.elastiknn / lucene > >>>>> < > >> > https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar > >>>>> and > >>>>> com.klibisz.elastiknn / models > >>>>> < > >> > https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar > >>>> . > >>>>> The tests are Scala but all of the implementation is in Java. > >>>>> > >>>>> Thanks, > >>>>> Alex > >>>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > >>> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >