[ https://issues.apache.org/jira/browse/LUCENE-9614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396283#comment-17396283 ]
Michael Sokolov commented on LUCENE-9614: ----------------------------------------- Thinking about how to make the scores be commensurate across different indexes for the same query ... in the case of dot product there's no issue since we assume all vectors are unit length (otherwise the dot-product similarity makes no sense), scores are always between 0 and 1 and there is no need for inversion or normalization. For the Euclidean distance, because we invert the scores to negative in order to sort descending, we need some way to normalize to make them non-negative. And -- it's not really clear at all how to control the range of scores from this query given the typical use case of a boolean query disjunctively combining "semantic" matches from HNSW with "keyword" matches from term queries. Ideally we'd return scores in a fixed range (0 - 1) and let the query writer control the balance between keyword and semantic queries with the boost. Possibly for these L2-normed queries, we can use something like {{score(q, d) = 1 - |q - d| / (|q| + |d|)}}. Then as {{d -> 0}} or {{d -> ∞}}, the score approaches 0, and score = 1 when q = d. > Implement KNN Query > ------------------- > > Key: LUCENE-9614 > URL: https://issues.apache.org/jira/browse/LUCENE-9614 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Michael Sokolov > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Now we have a vector index format, and one vector indexing/KNN search > implementation, but the interface is low-level: you can search across a > single segment only. We would like to expose a Query implementation. > Initially, we want to support a usage where the KnnVectorQuery selects the > k-nearest neighbors without regard to any other constraints, and these can > then be filtered as part of an enclosing Boolean or other query. > Later we will want to explore some kind of filtering *while* performing > vector search, or a re-entrant search process that can yield further results. > Because of the nature of knn search (all documents having any vector value > match), it is more like a ranking than a filtering operation, and it doesn't > really make sense to provide an iterator interface that can be merged in the > usual way, in docid order, skipping ahead. It's not yet clear how to satisfy > a query that is "k nearest neighbors satsifying some arbitrary Query", at > least not without realizing a complete bitset for the Query. But this is for > a later issue; *this* issue is just about performing the knn search in > isolation, computing a set of (some given) K nearest neighbors, and providing > an iterator over those. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org