Re: Experience re OpenAI embeddings in combination with Lucene vector search

Michael Wechner Tue, 15 Feb 2022 01:10:56 -0800

fair enough, but wouldn't it make sense that one can increase itprogrammatically, e.g.


.setVectorMaxDimension(2028)


?

Thanks

Michael


Am 14.02.22 um 23:34 schrieb Michael Sokolov:

I think we picked the 1024 number as something that seemed so large
nobody would ever want to exceed it! Obviously that was naive. Still
the limit serves as a cautionary point for users; if your vectors are
bigger than this, there is probably a better way to accomplish what
you are after (eg better off-line training to reduce dimensionality).
Is 1024 the magic number? Maybe not, but before increasing I'd like to
see some strong evidence that bigger vectors than that are indeed
useful as part of a search application using Lucene.

-Mike

On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani <juliet...@gmail.com> wrote:

Sounds good, hope the testing goes well! Memory and CPU (largely from more 
expensive vector distance calculations) are indeed the main factors to consider.

Julie

On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner <michael.wech...@wyona.com> 
wrote:

Hi Julie

Thanks again for your feedback!

I will do some more tests with "all-mpnet-base-v2" (768) and 
"all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)

But yes, I could imagine, that eventually it might make sense to allow more 
dimensions than 1024.

Beside memory and  "CPU", are there other limiting factors re more dimensions?

Thanks

Michael

Am 14.02.22 um 21:53 schrieb Julie Tibshirani:

Hello Michael, the max number of dimensions is currently hardcoded and can't be 
changed. I could see an argument for increasing the default a bit and would be 
happy to discuss if you'd like to file a JIRA issue. However 12288 dimensions 
still seems high to me, this is much larger than most well-established 
embedding models and could require a lot of memory.

Julie

On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner <michael.wech...@wyona.com> 
wrote:

Hi Julie

Thanks very much for this link, which is very interesting!

Btw, do you have an idea how to increase the default max size of 1024?

https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o

Thanks

Michael



Am 14.02.22 um 17:45 schrieb Julie Tibshirani:

Hello Michael, I don't have personal experience with these models, but I found 
this article insightful: 
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
 It evaluates the OpenAI models against a variety of existing models on tasks 
like sentence similarity and text retrieval. Although the other models are 
cheaper and have fewer dimensions, the OpenAI ones perform similarly or worse. 
This got me thinking that they might not be a good cost/ effectiveness 
trade-off, especially the larger ones with 4096 or 12288 dimensions.

Julie

On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner <michael.wech...@wyona.com> 
wrote:

Re the OpenAI embedding the following recent paper might be of interest

https://arxiv.org/pdf/2201.10005.pdf

(Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)

Thanks

Michael

Am 13.02.22 um 00:14 schrieb Michael Wechner:

Here a concrete example where I combine OpenAI model "text-similarity-ada-001" 
with Lucene vector search

INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
    score '0.98860765'

2) What was your age last year?
    score '0.97811764'

3) What is your age?
    score '0.97094905'

4) How old are you?
    score '0.9600177'


Result 1 is great and result 2 looks similar, but is not correct from an 
"understanding" point of view and results 3 and 4 are good again.

I understand "similarity" is not the same as "understanding", but I hope it 
makes it clearer what I am looking for :-)

Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for example whether the 
following two sentences are similar resp. likely to mean the same thing

"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp. do not mean the 
same thing

"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for example with SBERT 
(https://sbert.net/docs/usage/semantic_textual_similarity.html)

Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we contributed the first 
neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to integrate it?
Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner, <michael.wech...@wyona.com> wrote:

Hi

Does anyone have experience using OpenAI embeddings in combination with Lucene 
vector search?

https://beta.openai.com/docs/guides/embeddings

for example comparing performance re vector size

https://api.openai.com/v1/engines/text-similarity-ada-001/embeddings

and

https://api.openai.com/v1/engines/text-similarity-davinci-001/embeddings

?


Thanks

Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Experience re OpenAI embeddings in combination with Lucene vector search

Reply via email to