Re: Lucene/Solr and BERT

Julie Tibshirani Thu, 27 May 2021 10:05:31 -0700

Your summary sounds right to me. There are some ideas (being discussed on
the issue), but I don't think we have a detailed understanding yet of the
performance difference.


It would be great to get more eyes on the benchmark if you're interested in
double-checking the results. Mike mentioned that he saw a similar
performance difference in search (7-8x) when he ran his own benchmarks.

Julie




On Thu, May 27, 2021 at 12:55 AM Michael Wechner <michael.wech...@wyona.com>
wrote:

> Thank you very much for having done these benchmarks!
>
> IIUC one could state
>
> - Indexing:
>        Lucene is slower than hnswlib/C++, very roughly 10x performance
> difference
> - Searching (Queries per second):
>        Lucene is slower than hnswlib/C++, very roughly 8x performance
> difference
>
> right, but we should double-check these results?
>
> Also it is not clear at the moment why there is this performance
> difference, right?
>
>
> Am 27.05.21 um 03:33 schrieb Julie Tibshirani:
> > These JIRA issues contain results against two ann-benchmarks datasets.
> It'd
> > be great to get your thoughts/ feedback if you have any:
> > * Searching: https://issues.apache.org/jira/browse/LUCENE-9937
> > * Indexing: https://issues.apache.org/jira/browse/LUCENE-9941
> >
> > The benchmarks are based on the setup here:
> > https://github.com/jtibshirani/lucene/pull/1. I am happy to help if you
> run
> > into issues with it.
> >
> > A note: my motivation for running ann-benchmarks was to understand how
> the
> > current performance compares to other approaches, and to research ideas
> for
> > improvements. The setup in the PR doesn't feel solid/ maintainable as a
> > long term approach to development benchmarks. My personal plan is to
> focus
> > on enhancing luceneutil and our nightly benchmarks (
> > https://github.com/mikemccand/luceneutil) instead of putting a lot of
> > effort into the ann-benchmarks setup.
> >
> > Julie
> >
> > On Wed, May 26, 2021 at 1:04 PM Alex K <aklib...@gmail.com> wrote:
> >
> >> Thanks Michael. IIRC, the thing that was taking so long was merging
> into a
> >> single segment. Is there already benchmarking code for HNSW
> >> available somewhere? I feel like I remember someone posting benchmarking
> >> results on one of the Jira tickets.
> >>
> >> Thanks,
> >> Alex
> >>
> >> On Wed, May 26, 2021 at 3:41 PM Michael Sokolov <msoko...@gmail.com>
> >> wrote:
> >>
> >>> This java implementation will be slower than the C implementation. I
> >>> believe the algorithm is essentially the same, however this is new and
> >>> there may be bugs!  I (and I think Julie had similar results IIRC)
> >>> measured something like 8x slower than hnswlib (using ann-benchmarks).
> >>> It is also surprising (to me) though how this varies with
> >>> differently-learned vectors so YMMV. I still think there is value
> >>> here, and look forward to improved performance, especially as JDK16
> >>> has some improved support for vectorized instructions.
> >>>
> >>> Please also understand that the HNSW algorithm interacts with Lucene's
> >>> segmented architecture in a tricky way. Because we built a graph
> >>> *per-segment* when flushing/merging, these must be rebuilt whenever
> >>> segments are merged. So your indexing performance can be heavily
> >>> influenced by how often you flush, as well as by your merge policy
> >>> settings. Also, when searching, there is a bigger than usual benefit
> >>> for searching across fewer segments, since the cost of searching an
> >>> HNSW graph scales more or less with log N (so searching a single large
> >>> graph is cheaper than searching the same documents divided among
> >>> smaller graphs). So I do recommend using a multithreaded collector in
> >>> order to get best latency with HNSW-based search. To get the best
> >>> indexing, and searching, performance, you should generally index as
> >>> large a number of documents as possible before flushing.
> >>>
> >>> -Mike
> >>>
> >>> On Wed, May 26, 2021 at 9:43 AM Michael Wechner
> >>> <michael.wech...@wyona.com> wrote:
> >>>> Hi Alex
> >>>>
> >>>> Thank you very much for your feedback and the various insights!
> >>>>
> >>>> Am 26.05.21 um 04:41 schrieb Alex K:
> >>>>> Hi Michael and others,
> >>>>>
> >>>>> Sorry just now getting back to you. For your three original
> >> questions:
> >>>>> - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> >>>>> thorough response.
> >>>>> - As far as I know Opendistro is calling out to a C/C++ binary to run
> >>> the
> >>>>> actual HNSW algorithm and store the HNSW part of the index. When they
> >>>>> implemented it about a year ago, Lucene did not have this yet. I
> >>> assume the
> >>>>> Lucene HNSW implementation is solid, but would not be surprised if
> >> it's
> >>>>> slower than the C/C++ based implementation, given the JVM has some
> >>>>> disadvantages for these kinds of CPU-bound/number crunching algos.
> >>>>> - I just haven't had much time to invest into my benchmark recently.
> >> In
> >>>>> particular, I got stuck on why indexing was taking extremely long.
> >> Just
> >>>>> indexing the vectors would have easily exceeded the current time
> >>>>> limitations in the ANN-benchmarks project. Maybe I had some naive
> >>> mistake
> >>>>> in my implementation, but I profiled and dug pretty deep to make it
> >>> fast.
> >>>> I am trying to get Julie's branch running
> >>>>
> >>>> https://github.com/jtibshirani/lucene/tree/hnsw-bench
> >>>>
> >>>> Maybe this will help and is comparable
> >>>>
> >>>>
> >>>>> I'm assuming you want to use Lucene, but not necessarily via
> >>> Elasticsearch?
> >>>> Yes, for more simple setups I would like to use Lucene standalone, but
> >>>> for setups which have to scale I would use either Elasticsearch or
> >> Solr.
> >>>> Thanks
> >>>>
> >>>> Michael
> >>>>
> >>>>
> >>>>
> >>>>> If so, another option you might try for ANN is the elastiknn-models
> >>>>> and elastiknn-lucene packages. elastiknn-models contains the Locality
> >>>>> Sensitive Hashing implementations of ANN used by Elastiknn, and
> >>>>> elastiknn-lucene contains the Lucene queries used by Elastiknn.The
> >>> Lucene
> >>>>> query is the MatchHashesAndScoreQuery
> >>>>> <
> >>
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22
> >>>> .
> >>>>> There are a couple of scala test suites that show how to use it:
> >>>>> MatchHashesAndScoreQuerySuite
> >>>>> <
> >>
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala
> >>>> .
> >>>>> MatchHashesAndScoreQueryPerformanceSuite
> >>>>> <
> >>
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala
> >>>> .
> >>>>> This is all designed to work independently from Elasticsearch and is
> >>>>> published on Maven: com.klibisz.elastiknn / lucene
> >>>>> <
> >>
> https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar
> >>>>> and
> >>>>> com.klibisz.elastiknn / models
> >>>>> <
> >>
> https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar
> >>>> .
> >>>>> The tests are Scala but all of the implementation is in Java.
> >>>>>
> >>>>> Thanks,
> >>>>> Alex
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Lucene/Solr and BERT

Reply via email to