[
https://issues.apache.org/jira/browse/LUCENE-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433565#comment-17433565
]
Julie Tibshirani commented on LUCENE-10191:
-------------------------------------------
This is helpful feedback. I'm also sensitive to the fact that the more
complexity we add to a format, the harder it is for BWC and for future
implementations.
Some background: I think supporting Euclidean distance is really important.
With certain datasets, similarity is measured in terms of Euclidean distance
(instead of cosine), and in these cases it's critical to use Euclidean to get
sensible results. Cosine similarity is less critical, since we could ask users
to normalize all vectors to unit length before indexing + searching, and use
dot product. Personally I think cosine is valuable (more details in
https://issues.apache.org/jira/browse/LUCENE-10146), but am very happy to
discuss trade-offs. In general, supporting different vector functions is low
complexity compared to the ANN data structure itself.
{quote}Instead, slower functions needing different representation should really
be different codecs... And trying to support these functions the way it happens
now is wrong to do and will lead to hairballs.
{quote}
To check I understand the idea – are you suggesting a separate format per ANN
method, per similarity function?
> Optimize vector functions by precomputing magnitudes
> ----------------------------------------------------
>
> Key: LUCENE-10191
> URL: https://issues.apache.org/jira/browse/LUCENE-10191
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Julie Tibshirani
> Priority: Minor
>
> Both euclidean distance (L2 norm) and cosine similarity can be expressed in
> terms of dot product and vector magnitudes:
> * l2_norm(a, b) = ||a - b|| = sqrt(||a||^2 - 2(a . b) + ||b||^2)
> * cosine(a, b) = a . b / ||a|| ||b||
> We could compute and store each vector's magnitude upfront while indexing,
> and compute the query vector's magnitude once per query. Then we'd calculate
> the distance using our (very optimized) dot product method, plus the
> precomputed values.
> This is an exploratory issue: I haven't tested this out yet, so I'm not sure
> how much it would help. I would at least expect it to help with cosine
> similarity – several months ago we tried out similar ideas in Elasticsearch
> and were able to get a nice boost in cosine performance.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]