For practical search using BERT on any reasonable sized dataset, they're going to need ANN, which Lucene recently added. This won't work in practice if the query and document are of a different size, which is where sentence transformers see a lot of use for documents up to 500 words.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9004 https://github.com/UKPLab/sentence-transformers Russ On Sun, May 23, 2021 at 8:23 PM Michael Sokolov <msoko...@gmail.com> wrote: > Hi Michael, that is fully-functional in the sense that Lucene will > build an HNSW graph for a vector-valued field and you can then use the > VectorReader.search method to do KNN-based search. Next steps may > include some integration with lexical, inverted-index type search so > that you can retrieve N-closest constrained by other constraints. > Today you can approximate that by oversampling and filtering. There is > also interest in pursuing other KNN search algorithms, and we have > been working to make sure the VectorFormat API (might still get > renamed due to confusion with other kinds of vectors existing in > Lucene) can support alternative KNN implementations. > > On Wed, May 19, 2021 at 12:22 PM Michael Wechner > <michael.wech...@wyona.com> wrote: > > > > Hi Alex > > > > Just to make sure I understand better what the additions are about > > > > Am 21.04.21 um 17:21 schrieb Alex K: > > > There were a couple additions recently merged into lucene but not yet > > > released: > > > - A first-class vector codec > > > > do you mean the classes inside > > > > > https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90 > > > > and in particular > > > > Lucene90HnswVectorFormat.java Lucene90HnswVectorReader.java > > Lucene90HnswVectorWriter.java > > > > ? > > > > > - An implementation of HNSW for approximate nearest neighbor search > > > > the HNSW implementation at > > > > > https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/util/hnsw > > > > is similar to > > > > > https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/ > > > > ? > > > > > > They are however available in the snapshot releases. I started on a > small > > > project to get the HNSW implementation into the ann-benchmarks > project, but > > > had to set it aside. > > > > Is there still something missing? Or what would be the next steps? > > > > Thanks > > > > Michael > > > > > > > Here's the code: > > > https://github.com/alexklibisz/ann-benchmarks-lucene. There are some > test > > > suites that index and search Glove vectors. My first impression was > that > > > indexing seems surprisingly slow, but it's entirely possible I'm doing > > > something wrong. > > > > > > On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner < > michael.wech...@wyona.com> > > > wrote: > > > > > >> Hi > > >> > > >> I recently found the following articles re Lucene/Solr and BERT > > >> > > >> > https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28 > > >> > > >> > https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559 > > >> > > >> and would like to ask whether there might be more recent developments > > >> within the Lucene/Solr community re BERT integration? > > >> > > >> Also how these developments relate to > > >> > > >> https://sbert.net/ > > >> > > >> ? > > >> > > >> Thanks very much for your insights! > > >> > > >> Michael > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com