jtibshirani commented on a change in pull request #656: URL: https://github.com/apache/lucene/pull/656#discussion_r803965732
########## File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java ########## @@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws IOException { return createRewrittenQuery(reader, topK); } - private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits bitsFilter) + private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, BitSetCollector filterCollector) throws IOException { - // If the filter is non-null, then it already handles live docs - if (bitsFilter == null) { - bitsFilter = ctx.reader().getLiveDocs(); + + if (filterCollector == null) { + Bits acceptDocs = ctx.reader().getLiveDocs(); + return ctx.reader() + .searchNearestVectors(field, target, kPerLeaf, acceptDocs, Integer.MAX_VALUE); + } else { + BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord); + if (filterIterator == null || filterIterator.cost() == 0) { + return NO_RESULTS; + } + + if (filterIterator.cost() <= k) { + // If there <= k possible matches, short-circuit and perform exact search, since HNSW must + // always visit at least k documents + return exactSearch(ctx, target, k, filterIterator); + } + + try { + // The filter iterator already incorporates live docs + Bits acceptDocs = filterIterator.getBitSet(); + int visitedLimit = (int) filterIterator.cost(); + return ctx.reader().searchNearestVectors(field, target, kPerLeaf, acceptDocs, visitedLimit); + } catch ( + @SuppressWarnings("unused") + CollectionTerminatedException e) { Review comment: I agree, it's nice to avoid using exceptions for normal control flow. I'm not too concerned from a performance perspective though, exceptions aren't thrown in a "hot loop" and I didn't see a perf hit in testing. If we go the route of using `TopDocs`, I'd prefer to avoid 'null' since that's a bit overloaded (indicates the field is missing or does not have vectors). Brainstorming ideas: * Just return `EMPTY_TOPDOCS`. * Still return best score docs and the visited count. But use `EQUAL_TO` for `TotalHits.Relation` if the search completed normally, otherwise use `GREATER_THAN_OR_EQUAL_TO`. * Use a special subtype of `TopDocs` instead, which has an explicit "complete" flag? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org