kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1806196196
Summary of new changes: 1. Refactor into a more appropriate query - Move away from `AbstractKnnVectorQuery` to take advantage of inherent independence of segment-level results - KNN queries need to execute the core logic in `#rewrite` because of an inter-dependence of segment-level results (that is, given N segment-level hits we cannot determine if they will appear in the index-level `topK` without knowing results from other segments). This leads to requirements of [custom concurrency](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L82-L88) for individual HNSW searches, which should ideally be parallel by default - We can move graph searches down to a more appropriate place (like `#scorer`) to take advantage of this 2. Return a lazy-loading iterator instead of a greedy exact search (thanks @jpountz!) - Introduce a `visitLimit` on the number of nodes to traverse before stopping graph search - deeming it "too expensive". Once this is exhausted, return a lazy-loading iterator on all vectors (functionally equivalent to an exact search) - Unlike KNN queries, which need to traverse all vectors to determine which ones are present in the `topK` best-scoring ones, a similarity-based vector search can independently determine if a vector is a result or not (based on whether its similarity with the query is above a `resultSimilarity`) - Making use of this behavior, we can prevent a greedy exact search for collecting all matching docs into a list on heap, and determine if a vector is a match inside a `FilteredDocIdSetIterator` - This has a huge benefit when the query will be one of the clauses of a `BooleanQuery` (so other clauses will filter out non-matching docs and this query will only compute similarity scores with already filtered vectors). In the worst case, this will consider all vectors (same as exact search) - We also have useful information from graph search - mainly which hit was evaluated, and which hit was collected. This information can be re-used from the iterator: if a hit has been traversed, it will either be added to the results, or discarded. If it is present in the results, we simply lookup the score, otherwise mark it as rejected - If a vector has not been traversed in graph search, we compute its similarity score (so each query - document pair will only compute similarity scores once) Please let me know if this approach makes sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org