Re: Lucene/Solr and BERT

Alex K Wed, 26 May 2021 13:04:39 -0700

Thanks Michael. IIRC, the thing that was taking so long was merging into a
single segment. Is there already benchmarking code for HNSW
available somewhere? I feel like I remember someone posting benchmarking
results on one of the Jira tickets.


Thanks,
Alex

On Wed, May 26, 2021 at 3:41 PM Michael Sokolov <msoko...@gmail.com> wrote:

> This java implementation will be slower than the C implementation. I
> believe the algorithm is essentially the same, however this is new and
> there may be bugs!  I (and I think Julie had similar results IIRC)
> measured something like 8x slower than hnswlib (using ann-benchmarks).
> It is also surprising (to me) though how this varies with
> differently-learned vectors so YMMV. I still think there is value
> here, and look forward to improved performance, especially as JDK16
> has some improved support for vectorized instructions.
>
> Please also understand that the HNSW algorithm interacts with Lucene's
> segmented architecture in a tricky way. Because we built a graph
> *per-segment* when flushing/merging, these must be rebuilt whenever
> segments are merged. So your indexing performance can be heavily
> influenced by how often you flush, as well as by your merge policy
> settings. Also, when searching, there is a bigger than usual benefit
> for searching across fewer segments, since the cost of searching an
> HNSW graph scales more or less with log N (so searching a single large
> graph is cheaper than searching the same documents divided among
> smaller graphs). So I do recommend using a multithreaded collector in
> order to get best latency with HNSW-based search. To get the best
> indexing, and searching, performance, you should generally index as
> large a number of documents as possible before flushing.
>
> -Mike
>
> On Wed, May 26, 2021 at 9:43 AM Michael Wechner
> <michael.wech...@wyona.com> wrote:
> >
> > Hi Alex
> >
> > Thank you very much for your feedback and the various insights!
> >
> > Am 26.05.21 um 04:41 schrieb Alex K:
> > > Hi Michael and others,
> > >
> > > Sorry just now getting back to you. For your three original questions:
> > >
> > > - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> > > thorough response.
> > > - As far as I know Opendistro is calling out to a C/C++ binary to run
> the
> > > actual HNSW algorithm and store the HNSW part of the index. When they
> > > implemented it about a year ago, Lucene did not have this yet. I
> assume the
> > > Lucene HNSW implementation is solid, but would not be surprised if it's
> > > slower than the C/C++ based implementation, given the JVM has some
> > > disadvantages for these kinds of CPU-bound/number crunching algos.
> > > - I just haven't had much time to invest into my benchmark recently. In
> > > particular, I got stuck on why indexing was taking extremely long. Just
> > > indexing the vectors would have easily exceeded the current time
> > > limitations in the ANN-benchmarks project. Maybe I had some naive
> mistake
> > > in my implementation, but I profiled and dug pretty deep to make it
> fast.
> >
> > I am trying to get Julie's branch running
> >
> > https://github.com/jtibshirani/lucene/tree/hnsw-bench
> >
> > Maybe this will help and is comparable
> >
> >
> > >
> > > I'm assuming you want to use Lucene, but not necessarily via
> Elasticsearch?
> >
> > Yes, for more simple setups I would like to use Lucene standalone, but
> > for setups which have to scale I would use either Elasticsearch or Solr.
> >
> > Thanks
> >
> > Michael
> >
> >
> >
> > > If so, another option you might try for ANN is the elastiknn-models
> > > and elastiknn-lucene packages. elastiknn-models contains the Locality
> > > Sensitive Hashing implementations of ANN used by Elastiknn, and
> > > elastiknn-lucene contains the Lucene queries used by Elastiknn.The
> Lucene
> > > query is the MatchHashesAndScoreQuery
> > > <
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22
> >.
> > > There are a couple of scala test suites that show how to use it:
> > > MatchHashesAndScoreQuerySuite
> > > <
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala
> >.
> > > MatchHashesAndScoreQueryPerformanceSuite
> > > <
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala
> >.
> > > This is all designed to work independently from Elasticsearch and is
> > > published on Maven: com.klibisz.elastiknn / lucene
> > > <
> https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar
> >
> > > and
> > > com.klibisz.elastiknn / models
> > > <
> https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar
> >.
> > > The tests are Scala but all of the implementation is in Java.
> > >
> > > Thanks,
> > > Alex
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Lucene/Solr and BERT

Reply via email to