Hi Russ

I would like to use it for detecting duplicated questions, whereas I am currently using the project sbert.net you mention below to do the embedding with a size of 768 for indexing and querying.

sbert has an example listed using "util.pytorch_cos_sim(A,B) as a ||||brute-force approach

https://sbert.net/docs/usage/semantic_textual_similarity.html

and "paraphrase mining" approach for larger document collections

https://sbert.net/examples/applications/paraphrase-mining/README.html

Re the Lucene ANN implementation(s) I think it would be very interesting to participate in the ANN benchmarking challenge which Julie mentioned on the dev list

http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCAKDq9%3D4rSuuczoK%2BcVg_N6Lwvh42E%2BXUoSGQ6m7BgqzuDvACew%40mail.gmail.com%3E

https://medium.com/big-ann-benchmarks/neurips-2021-announcement-the-billion-scale-approximate-nearest-neighbor-search-challenge-72858f768f69

Thanks

Michael



Am 24.05.21 um 05:31 schrieb Russell Jurney:
For practical search using BERT on any reasonable sized dataset, they're
going to need ANN, which Lucene recently added. This won't work in practice
if the query and document are of a different size, which is where sentence
transformers see a lot of use for documents up to 500 words.

https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9004

https://github.com/UKPLab/sentence-transformers

Russ

On Sun, May 23, 2021 at 8:23 PM Michael Sokolov <msoko...@gmail.com> wrote:

Hi Michael, that is fully-functional in the sense that Lucene will
build an HNSW graph for a vector-valued field and you can then use the
VectorReader.search method to do KNN-based search. Next steps may
include some integration with lexical, inverted-index type search so
that you can retrieve N-closest constrained by other constraints.
Today you can approximate that by oversampling and filtering. There is
also interest in pursuing other KNN search algorithms, and we have
been working to make sure the VectorFormat API (might still get
renamed due to confusion with other kinds of vectors existing in
Lucene) can support alternative KNN implementations.

On Wed, May 19, 2021 at 12:22 PM Michael Wechner
<michael.wech...@wyona.com> wrote:
Hi Alex

Just to make sure I understand better what the additions are about

Am 21.04.21 um 17:21 schrieb Alex K:
There were a couple additions recently merged into lucene but not yet
released:
- A first-class vector codec
do you mean the classes inside


https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90
and in particular

Lucene90HnswVectorFormat.java  Lucene90HnswVectorReader.java
Lucene90HnswVectorWriter.java

?

- An implementation of HNSW for approximate nearest neighbor search
the HNSW implementation at


https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/util/hnsw
is similar to


https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/
?
They are however available in the snapshot releases. I started on a
small
project to get the HNSW implementation into the ann-benchmarks
project, but
had to set it aside.
Is there still something missing? Or what would be the next steps?

Thanks

Michael


   Here's the code:
https://github.com/alexklibisz/ann-benchmarks-lucene. There are some
test
suites that index and search Glove vectors. My first impression was
that
indexing seems surprisingly slow, but it's entirely possible I'm doing
something wrong.

On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner <
michael.wech...@wyona.com>
wrote:

Hi

I recently found the following articles re Lucene/Solr and BERT


https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28

https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
and would like to ask whether there might be more recent developments
within the Lucene/Solr community re BERT integration?

Also how these developments relate to

https://sbert.net/

?

Thanks very much for your insights!

Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

--
Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com


Reply via email to