Thank you very much for having done these benchmarks!
IIUC one could state
- Indexing:
Lucene is slower than hnswlib/C++, very roughly 10x performance
difference
- Searching (Queries per second):
Lucene is slower than hnswlib/C++, very roughly 8x performance
difference
right, but we should double-check these results?
Also it is not clear at the moment why there is this performance
difference, right?
Am 27.05.21 um 03:33 schrieb Julie Tibshirani:
These JIRA issues contain results against two ann-benchmarks datasets. It'd
be great to get your thoughts/ feedback if you have any:
* Searching: https://issues.apache.org/jira/browse/LUCENE-9937
* Indexing: https://issues.apache.org/jira/browse/LUCENE-9941
The benchmarks are based on the setup here:
https://github.com/jtibshirani/lucene/pull/1. I am happy to help if you run
into issues with it.
A note: my motivation for running ann-benchmarks was to understand how the
current performance compares to other approaches, and to research ideas for
improvements. The setup in the PR doesn't feel solid/ maintainable as a
long term approach to development benchmarks. My personal plan is to focus
on enhancing luceneutil and our nightly benchmarks (
https://github.com/mikemccand/luceneutil) instead of putting a lot of
effort into the ann-benchmarks setup.
Julie
On Wed, May 26, 2021 at 1:04 PM Alex K <aklib...@gmail.com> wrote:
Thanks Michael. IIRC, the thing that was taking so long was merging into a
single segment. Is there already benchmarking code for HNSW
available somewhere? I feel like I remember someone posting benchmarking
results on one of the Jira tickets.
Thanks,
Alex
On Wed, May 26, 2021 at 3:41 PM Michael Sokolov <msoko...@gmail.com>
wrote:
This java implementation will be slower than the C implementation. I
believe the algorithm is essentially the same, however this is new and
there may be bugs! I (and I think Julie had similar results IIRC)
measured something like 8x slower than hnswlib (using ann-benchmarks).
It is also surprising (to me) though how this varies with
differently-learned vectors so YMMV. I still think there is value
here, and look forward to improved performance, especially as JDK16
has some improved support for vectorized instructions.
Please also understand that the HNSW algorithm interacts with Lucene's
segmented architecture in a tricky way. Because we built a graph
*per-segment* when flushing/merging, these must be rebuilt whenever
segments are merged. So your indexing performance can be heavily
influenced by how often you flush, as well as by your merge policy
settings. Also, when searching, there is a bigger than usual benefit
for searching across fewer segments, since the cost of searching an
HNSW graph scales more or less with log N (so searching a single large
graph is cheaper than searching the same documents divided among
smaller graphs). So I do recommend using a multithreaded collector in
order to get best latency with HNSW-based search. To get the best
indexing, and searching, performance, you should generally index as
large a number of documents as possible before flushing.
-Mike
On Wed, May 26, 2021 at 9:43 AM Michael Wechner
<michael.wech...@wyona.com> wrote:
Hi Alex
Thank you very much for your feedback and the various insights!
Am 26.05.21 um 04:41 schrieb Alex K:
Hi Michael and others,
Sorry just now getting back to you. For your three original
questions:
- Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
thorough response.
- As far as I know Opendistro is calling out to a C/C++ binary to run
the
actual HNSW algorithm and store the HNSW part of the index. When they
implemented it about a year ago, Lucene did not have this yet. I
assume the
Lucene HNSW implementation is solid, but would not be surprised if
it's
slower than the C/C++ based implementation, given the JVM has some
disadvantages for these kinds of CPU-bound/number crunching algos.
- I just haven't had much time to invest into my benchmark recently.
In
particular, I got stuck on why indexing was taking extremely long.
Just
indexing the vectors would have easily exceeded the current time
limitations in the ANN-benchmarks project. Maybe I had some naive
mistake
in my implementation, but I profiled and dug pretty deep to make it
fast.
I am trying to get Julie's branch running
https://github.com/jtibshirani/lucene/tree/hnsw-bench
Maybe this will help and is comparable
I'm assuming you want to use Lucene, but not necessarily via
Elasticsearch?
Yes, for more simple setups I would like to use Lucene standalone, but
for setups which have to scale I would use either Elasticsearch or
Solr.
Thanks
Michael
If so, another option you might try for ANN is the elastiknn-models
and elastiknn-lucene packages. elastiknn-models contains the Locality
Sensitive Hashing implementations of ANN used by Elastiknn, and
elastiknn-lucene contains the Lucene queries used by Elastiknn.The
Lucene
query is the MatchHashesAndScoreQuery
<
https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22
.
There are a couple of scala test suites that show how to use it:
MatchHashesAndScoreQuerySuite
<
https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala
.
MatchHashesAndScoreQueryPerformanceSuite
<
https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala
.
This is all designed to work independently from Elasticsearch and is
published on Maven: com.klibisz.elastiknn / lucene
<
https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar
and
com.klibisz.elastiknn / models
<
https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar
.
The tests are Scala but all of the implementation is in Java.
Thanks,
Alex
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org