Re: Lucene/Solr and BERT

Michael Wechner Thu, 27 May 2021 00:55:26 -0700

Thank you very much for having done these benchmarks!


IIUC one could state

- Indexing:

Lucene is slower than hnswlib/C++, very roughly 10x performancedifference

- Searching (Queries per second):

Lucene is slower than hnswlib/C++, very roughly 8x performancedifference


right, but we should double-check these results?

Also it is not clear at the moment why there is this performancedifference, right?



Am 27.05.21 um 03:33 schrieb Julie Tibshirani:

These JIRA issues contain results against two ann-benchmarks datasets. It'd
be great to get your thoughts/ feedback if you have any:
* Searching: https://issues.apache.org/jira/browse/LUCENE-9937
* Indexing: https://issues.apache.org/jira/browse/LUCENE-9941

The benchmarks are based on the setup here:
https://github.com/jtibshirani/lucene/pull/1. I am happy to help if you run
into issues with it.

A note: my motivation for running ann-benchmarks was to understand how the
current performance compares to other approaches, and to research ideas for
improvements. The setup in the PR doesn't feel solid/ maintainable as a
long term approach to development benchmarks. My personal plan is to focus
on enhancing luceneutil and our nightly benchmarks (
https://github.com/mikemccand/luceneutil) instead of putting a lot of
effort into the ann-benchmarks setup.

Julie

On Wed, May 26, 2021 at 1:04 PM Alex K <[email protected]> wrote:

Thanks Michael. IIRC, the thing that was taking so long was merging into a
single segment. Is there already benchmarking code for HNSW
available somewhere? I feel like I remember someone posting benchmarking
results on one of the Jira tickets.

Thanks,
Alex

On Wed, May 26, 2021 at 3:41 PM Michael Sokolov <[email protected]>
wrote:

This java implementation will be slower than the C implementation. I
believe the algorithm is essentially the same, however this is new and
there may be bugs!  I (and I think Julie had similar results IIRC)
measured something like 8x slower than hnswlib (using ann-benchmarks).
It is also surprising (to me) though how this varies with
differently-learned vectors so YMMV. I still think there is value
here, and look forward to improved performance, especially as JDK16
has some improved support for vectorized instructions.

Please also understand that the HNSW algorithm interacts with Lucene's
segmented architecture in a tricky way. Because we built a graph
*per-segment* when flushing/merging, these must be rebuilt whenever
segments are merged. So your indexing performance can be heavily
influenced by how often you flush, as well as by your merge policy
settings. Also, when searching, there is a bigger than usual benefit
for searching across fewer segments, since the cost of searching an
HNSW graph scales more or less with log N (so searching a single large
graph is cheaper than searching the same documents divided among
smaller graphs). So I do recommend using a multithreaded collector in
order to get best latency with HNSW-based search. To get the best
indexing, and searching, performance, you should generally index as
large a number of documents as possible before flushing.

-Mike

On Wed, May 26, 2021 at 9:43 AM Michael Wechner
<[email protected]> wrote:

Hi Alex

Thank you very much for your feedback and the various insights!

Am 26.05.21 um 04:41 schrieb Alex K:

Hi Michael and others,

Sorry just now getting back to you. For your three original

questions:

- Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
thorough response.
- As far as I know Opendistro is calling out to a C/C++ binary to run

the

actual HNSW algorithm and store the HNSW part of the index. When they
implemented it about a year ago, Lucene did not have this yet. I

assume the

Lucene HNSW implementation is solid, but would not be surprised if

it's

slower than the C/C++ based implementation, given the JVM has some
disadvantages for these kinds of CPU-bound/number crunching algos.
- I just haven't had much time to invest into my benchmark recently.

In

particular, I got stuck on why indexing was taking extremely long.

Just

indexing the vectors would have easily exceeded the current time
limitations in the ANN-benchmarks project. Maybe I had some naive

mistake

in my implementation, but I profiled and dug pretty deep to make it

fast.

I am trying to get Julie's branch running

https://github.com/jtibshirani/lucene/tree/hnsw-bench

Maybe this will help and is comparable

I'm assuming you want to use Lucene, but not necessarily via

Elasticsearch?

Yes, for more simple setups I would like to use Lucene standalone, but
for setups which have to scale I would use either Elasticsearch or

Solr.

Thanks

Michael

If so, another option you might try for ANN is the elastiknn-models
and elastiknn-lucene packages. elastiknn-models contains the Locality
Sensitive Hashing implementations of ANN used by Elastiknn, and
elastiknn-lucene contains the Lucene queries used by Elastiknn.The

Lucene

query is the MatchHashesAndScoreQuery
<

https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22

There are a couple of scala test suites that show how to use it:
MatchHashesAndScoreQuerySuite
<

https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala

MatchHashesAndScoreQueryPerformanceSuite
<

https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala

This is all designed to work independently from Elasticsearch and is
published on Maven: com.klibisz.elastiknn / lucene
<

https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar

and
com.klibisz.elastiknn / models
<

https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar

The tests are Scala but all of the implementation is in Java.

Thanks,
Alex


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene/Solr and BERT

Reply via email to