Re: Lucene/Solr and BERT

Julie Tibshirani Wed, 26 May 2021 18:34:14 -0700

These JIRA issues contain results against two ann-benchmarks datasets. It'd
be great to get your thoughts/ feedback if you have any:
* Searching: https://issues.apache.org/jira/browse/LUCENE-9937
* Indexing: https://issues.apache.org/jira/browse/LUCENE-9941


The benchmarks are based on the setup here:
https://github.com/jtibshirani/lucene/pull/1. I am happy to help if you run
into issues with it.

A note: my motivation for running ann-benchmarks was to understand how the
current performance compares to other approaches, and to research ideas for
improvements. The setup in the PR doesn't feel solid/ maintainable as a
long term approach to development benchmarks. My personal plan is to focus
on enhancing luceneutil and our nightly benchmarks (
https://github.com/mikemccand/luceneutil) instead of putting a lot of
effort into the ann-benchmarks setup.

Julie

On Wed, May 26, 2021 at 1:04 PM Alex K <[email protected]> wrote:

> Thanks Michael. IIRC, the thing that was taking so long was merging into a
> single segment. Is there already benchmarking code for HNSW
> available somewhere? I feel like I remember someone posting benchmarking
> results on one of the Jira tickets.
>
> Thanks,
> Alex
>
> On Wed, May 26, 2021 at 3:41 PM Michael Sokolov <[email protected]>
> wrote:
>
> > This java implementation will be slower than the C implementation. I
> > believe the algorithm is essentially the same, however this is new and
> > there may be bugs!  I (and I think Julie had similar results IIRC)
> > measured something like 8x slower than hnswlib (using ann-benchmarks).
> > It is also surprising (to me) though how this varies with
> > differently-learned vectors so YMMV. I still think there is value
> > here, and look forward to improved performance, especially as JDK16
> > has some improved support for vectorized instructions.
> >
> > Please also understand that the HNSW algorithm interacts with Lucene's
> > segmented architecture in a tricky way. Because we built a graph
> > *per-segment* when flushing/merging, these must be rebuilt whenever
> > segments are merged. So your indexing performance can be heavily
> > influenced by how often you flush, as well as by your merge policy
> > settings. Also, when searching, there is a bigger than usual benefit
> > for searching across fewer segments, since the cost of searching an
> > HNSW graph scales more or less with log N (so searching a single large
> > graph is cheaper than searching the same documents divided among
> > smaller graphs). So I do recommend using a multithreaded collector in
> > order to get best latency with HNSW-based search. To get the best
> > indexing, and searching, performance, you should generally index as
> > large a number of documents as possible before flushing.
> >
> > -Mike
> >
> > On Wed, May 26, 2021 at 9:43 AM Michael Wechner
> > <[email protected]> wrote:
> > >
> > > Hi Alex
> > >
> > > Thank you very much for your feedback and the various insights!
> > >
> > > Am 26.05.21 um 04:41 schrieb Alex K:
> > > > Hi Michael and others,
> > > >
> > > > Sorry just now getting back to you. For your three original
> questions:
> > > >
> > > > - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> > > > thorough response.
> > > > - As far as I know Opendistro is calling out to a C/C++ binary to run
> > the
> > > > actual HNSW algorithm and store the HNSW part of the index. When they
> > > > implemented it about a year ago, Lucene did not have this yet. I
> > assume the
> > > > Lucene HNSW implementation is solid, but would not be surprised if
> it's
> > > > slower than the C/C++ based implementation, given the JVM has some
> > > > disadvantages for these kinds of CPU-bound/number crunching algos.
> > > > - I just haven't had much time to invest into my benchmark recently.
> In
> > > > particular, I got stuck on why indexing was taking extremely long.
> Just
> > > > indexing the vectors would have easily exceeded the current time
> > > > limitations in the ANN-benchmarks project. Maybe I had some naive
> > mistake
> > > > in my implementation, but I profiled and dug pretty deep to make it
> > fast.
> > >
> > > I am trying to get Julie's branch running
> > >
> > > https://github.com/jtibshirani/lucene/tree/hnsw-bench
> > >
> > > Maybe this will help and is comparable
> > >
> > >
> > > >
> > > > I'm assuming you want to use Lucene, but not necessarily via
> > Elasticsearch?
> > >
> > > Yes, for more simple setups I would like to use Lucene standalone, but
> > > for setups which have to scale I would use either Elasticsearch or
> Solr.
> > >
> > > Thanks
> > >
> > > Michael
> > >
> > >
> > >
> > > > If so, another option you might try for ANN is the elastiknn-models
> > > > and elastiknn-lucene packages. elastiknn-models contains the Locality
> > > > Sensitive Hashing implementations of ANN used by Elastiknn, and
> > > > elastiknn-lucene contains the Lucene queries used by Elastiknn.The
> > Lucene
> > > > query is the MatchHashesAndScoreQuery
> > > > <
> >
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22
> > >.
> > > > There are a couple of scala test suites that show how to use it:
> > > > MatchHashesAndScoreQuerySuite
> > > > <
> >
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala
> > >.
> > > > MatchHashesAndScoreQueryPerformanceSuite
> > > > <
> >
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala
> > >.
> > > > This is all designed to work independently from Elasticsearch and is
> > > > published on Maven: com.klibisz.elastiknn / lucene
> > > > <
> >
> https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar
> > >
> > > > and
> > > > com.klibisz.elastiknn / models
> > > > <
> >
> https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar
> > >.
> > > > The tests are Scala but all of the implementation is in Java.
> > > >
> > > > Thanks,
> > > > Alex
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: Lucene/Solr and BERT

Reply via email to