jtibshirani commented on pull request #235: URL: https://github.com/apache/lucene/pull/235#issuecomment-896029536
> I noticed that scores for the default similarity (Euclidean) had very low precision as they got large... The way we were handling this was to apply an `exp(-distance)` to convert distances to scores. I wonder if we could just swap in `f(x) = 1 / (1 + x)`, which decays a lot more slowly than `exp(-x)`. This maintains the nice property producing scores within [0, 1]. > There's a clever implementation (hack?!) to deal with trying to minimize over-collection across multiple segments. Basically the idea is to optimistically collect the expected proportion of top K based on the segment size (plus a margin)... This is a nice idea! The binomial estimate is based on the idea that nearest vectors are randomly distributed through the index. But since segment membership is related to when a document was indexed, I wonder if it'll be common for most nearest neighbors to be found in one segment. For example, maybe we are indexing (and embedding) news articles as they're written, and our query is a news event. Would it make sense to start with a simple approach where we just collect 'k' from each segment? Then we would explore optimizations in a follow-up with benchmarks? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
