jtibshirani commented on a change in pull request #656:
URL: https://github.com/apache/lucene/pull/656#discussion_r803965732



##########
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##########
@@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws 
IOException {
     return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits 
bitsFilter)
+  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, 
BitSetCollector filterCollector)
       throws IOException {
-    // If the filter is non-null, then it already handles live docs
-    if (bitsFilter == null) {
-      bitsFilter = ctx.reader().getLiveDocs();
+
+    if (filterCollector == null) {
+      Bits acceptDocs = ctx.reader().getLiveDocs();
+      return ctx.reader()
+          .searchNearestVectors(field, target, kPerLeaf, acceptDocs, 
Integer.MAX_VALUE);
+    } else {
+      BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
+      if (filterIterator == null || filterIterator.cost() == 0) {
+        return NO_RESULTS;
+      }
+
+      if (filterIterator.cost() <= k) {
+        // If there <= k possible matches, short-circuit and perform exact 
search, since HNSW must
+        // always visit at least k documents
+        return exactSearch(ctx, target, k, filterIterator);
+      }
+
+      try {
+        // The filter iterator already incorporates live docs
+        Bits acceptDocs = filterIterator.getBitSet();
+        int visitedLimit = (int) filterIterator.cost();
+        return ctx.reader().searchNearestVectors(field, target, kPerLeaf, 
acceptDocs, visitedLimit);
+      } catch (
+          @SuppressWarnings("unused")
+          CollectionTerminatedException e) {

Review comment:
       I agree, it's nice to avoid using exceptions for normal control flow. 
I'm not too concerned from a performance perspective though, exceptions aren't 
thrown in a "hot loop" and I didn't see a perf hit in testing.
   
   If we go the route of using `TopDocs`, I'd prefer to avoid 'null' since 
that's a bit overloaded (indicates the field is missing or does not have 
vectors). Brainstorming ideas:
   * Just return `EMPTY_TOPDOCS`.
   * Still return best score docs and the visited count. But use `EQUAL_TO` for 
`TotalHits.Relation` if the search completed normally, otherwise use 
`GREATER_THAN_OR_EQUAL_TO`. 
   * Use a special subtype of `TopDocs` instead, which has an explicit 
"complete" flag?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to