Re: Lucene/Solr and BERT

Alex K Tue, 25 May 2021 19:42:11 -0700

Hi Michael and others,

Sorry just now getting back to you. For your three original questions:


- Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
thorough response.
- As far as I know Opendistro is calling out to a C/C++ binary to run the
actual HNSW algorithm and store the HNSW part of the index. When they
implemented it about a year ago, Lucene did not have this yet. I assume the
Lucene HNSW implementation is solid, but would not be surprised if it's
slower than the C/C++ based implementation, given the JVM has some
disadvantages for these kinds of CPU-bound/number crunching algos.
- I just haven't had much time to invest into my benchmark recently. In
particular, I got stuck on why indexing was taking extremely long. Just
indexing the vectors would have easily exceeded the current time
limitations in the ANN-benchmarks project. Maybe I had some naive mistake
in my implementation, but I profiled and dug pretty deep to make it fast.

I'm assuming you want to use Lucene, but not necessarily via Elasticsearch?
If so, another option you might try for ANN is the elastiknn-models
and elastiknn-lucene packages. elastiknn-models contains the Locality
Sensitive Hashing implementations of ANN used by Elastiknn, and
elastiknn-lucene contains the Lucene queries used by Elastiknn.The Lucene
query is the MatchHashesAndScoreQuery
<https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22>.
There are a couple of scala test suites that show how to use it:
MatchHashesAndScoreQuerySuite
<https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala>.
MatchHashesAndScoreQueryPerformanceSuite
<https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala>.
This is all designed to work independently from Elasticsearch and is
published on Maven: com.klibisz.elastiknn / lucene
<https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar>
and
com.klibisz.elastiknn / models
<https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar>.
The tests are Scala but all of the implementation is in Java.

Thanks,
Alex

On Mon, May 24, 2021 at 3:06 AM Michael Wechner <[email protected]>
wrote:

> Hi Russ
>
> I would like to use it for detecting duplicated questions, whereas I am
> currently using the project sbert.net you mention below to do the
> embedding with a size of 768 for indexing and querying.
>
> sbert has an example listed using "util.pytorch_cos_sim(A,B) as a
> ||||brute-force approach
>
> https://sbert.net/docs/usage/semantic_textual_similarity.html
>
> and "paraphrase mining" approach for larger document collections
>
> https://sbert.net/examples/applications/paraphrase-mining/README.html
>
> Re the Lucene ANN implementation(s) I think it would be very interesting
> to participate in the ANN benchmarking challenge which Julie mentioned
> on the dev list
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCAKDq9%3D4rSuuczoK%2BcVg_N6Lwvh42E%2BXUoSGQ6m7BgqzuDvACew%40mail.gmail.com%3E
>
>
> https://medium.com/big-ann-benchmarks/neurips-2021-announcement-the-billion-scale-approximate-nearest-neighbor-search-challenge-72858f768f69
>
> Thanks
>
> Michael
>
>
>
> Am 24.05.21 um 05:31 schrieb Russell Jurney:
> > For practical search using BERT on any reasonable sized dataset, they're
> > going to need ANN, which Lucene recently added. This won't work in
> practice
> > if the query and document are of a different size, which is where
> sentence
> > transformers see a lot of use for documents up to 500 words.
> >
> > https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9004
> >
> > https://github.com/UKPLab/sentence-transformers
> >
> > Russ
> >
> > On Sun, May 23, 2021 at 8:23 PM Michael Sokolov <[email protected]>
> wrote:
> >
> >> Hi Michael, that is fully-functional in the sense that Lucene will
> >> build an HNSW graph for a vector-valued field and you can then use the
> >> VectorReader.search method to do KNN-based search. Next steps may
> >> include some integration with lexical, inverted-index type search so
> >> that you can retrieve N-closest constrained by other constraints.
> >> Today you can approximate that by oversampling and filtering. There is
> >> also interest in pursuing other KNN search algorithms, and we have
> >> been working to make sure the VectorFormat API (might still get
> >> renamed due to confusion with other kinds of vectors existing in
> >> Lucene) can support alternative KNN implementations.
> >>
> >> On Wed, May 19, 2021 at 12:22 PM Michael Wechner
> >> <[email protected]> wrote:
> >>> Hi Alex
> >>>
> >>> Just to make sure I understand better what the additions are about
> >>>
> >>> Am 21.04.21 um 17:21 schrieb Alex K:
> >>>> There were a couple additions recently merged into lucene but not yet
> >>>> released:
> >>>> - A first-class vector codec
> >>> do you mean the classes inside
> >>>
> >>>
> >>
> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90
> >>> and in particular
> >>>
> >>> Lucene90HnswVectorFormat.java  Lucene90HnswVectorReader.java
> >>> Lucene90HnswVectorWriter.java
> >>>
> >>> ?
> >>>
> >>>> - An implementation of HNSW for approximate nearest neighbor search
> >>> the HNSW implementation at
> >>>
> >>>
> >>
> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/util/hnsw
> >>> is similar to
> >>>
> >>>
> >>
> https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/
> >>> ?
> >>>> They are however available in the snapshot releases. I started on a
> >> small
> >>>> project to get the HNSW implementation into the ann-benchmarks
> >> project, but
> >>>> had to set it aside.
> >>> Is there still something missing? Or what would be the next steps?
> >>>
> >>> Thanks
> >>>
> >>> Michael
> >>>
> >>>
> >>>>    Here's the code:
> >>>> https://github.com/alexklibisz/ann-benchmarks-lucene. There are some
> >> test
> >>>> suites that index and search Glove vectors. My first impression was
> >> that
> >>>> indexing seems surprisingly slow, but it's entirely possible I'm doing
> >>>> something wrong.
> >>>>
> >>>> On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner <
> >> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi
> >>>>>
> >>>>> I recently found the following articles re Lucene/Solr and BERT
> >>>>>
> >>>>>
> >>
> https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
> >>>>>
> >>
> https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
> >>>>> and would like to ask whether there might be more recent developments
> >>>>> within the Lucene/Solr community re BERT integration?
> >>>>>
> >>>>> Also how these developments relate to
> >>>>>
> >>>>> https://sbert.net/
> >>>>>
> >>>>> ?
> >>>>>
> >>>>> Thanks very much for your insights!
> >>>>>
> >>>>> Michael
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [email protected]
> >>>>> For additional commands, e-mail: [email protected]
> >>>>>
> >>>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >> --
> > Thanks,
> > Russell Jurney @rjurney <http://twitter.com/rjurney>
> > [email protected] LI <http://linkedin.com/in/russelljurney> FB
> > <http://facebook.com/jurney> datasyndrome.com
> >
>
>

Re: Lucene/Solr and BERT

Reply via email to