jtibshirani edited a comment on pull request #235:
URL: https://github.com/apache/lucene/pull/235#issuecomment-896029536


   > I noticed that scores for the default similarity (Euclidean) had very low 
precision as they got large... The way we were handling this was to apply an 
`exp(-distance)` to convert distances to scores.
   
   I wonder if we could just swap in `f(x) = 1 / (1 + x)`, which decays a lot 
more slowly than `exp(-x)`. This maintains the nice property of producing 
scores within [0, 1].
   
   > There's a clever implementation (hack?!) to deal with trying to minimize 
over-collection across multiple segments. Basically the idea is to 
optimistically collect the expected proportion of top K based on the segment 
size (plus a margin)...
   
   This is a nice idea! The binomial estimate is based on the idea that nearest 
vectors are randomly distributed through the index. But since segment 
membership is related to when a document was indexed, I wonder if it'll be 
common for most nearest neighbors to be found in one segment. For example, 
maybe we are indexing (and embedding) news articles as they're written, and our 
query is a news event. Would it make sense to start with a simple approach 
where we just collect 'k' from each segment? Then we would explore 
optimizations in a follow-up with benchmarks?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to