msokolov commented on a change in pull request #656: URL: https://github.com/apache/lucene/pull/656#discussion_r803668310
########## File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java ########## @@ -76,17 +81,23 @@ public KnnVectorQuery(String field, float[] target, int k, Query filter) { @Override public Query rewrite(IndexReader reader) throws IOException { - BitSet[] bitSets = null; + TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()]; + BitSetCollector filterCollector = null; if (filter != null) { + filterCollector = new BitSetCollector(reader.leaves().size()); IndexSearcher indexSearcher = new IndexSearcher(reader); - bitSets = new BitSet[reader.leaves().size()]; - indexSearcher.search(filter, new BitSetCollector(bitSets)); + indexSearcher.search(filter, filterCollector); Review comment: for another day, but I am realizing that we have no opportunity to make use of per-segment concurrency here, as we ordinarily do in `IndexSearcher.search()`. To do so, we'd need to consider some API change though. Perhaps instead of using `rewrite` for this, we could make use of `Query`'s two-phase iteration mode of operation. Just a thought for later - I'll go open an issue elsewhere. ########## File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java ########## @@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws IOException { return createRewrittenQuery(reader, topK); } - private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits bitsFilter) + private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, BitSetCollector filterCollector) throws IOException { - // If the filter is non-null, then it already handles live docs - if (bitsFilter == null) { - bitsFilter = ctx.reader().getLiveDocs(); + + if (filterCollector == null) { + Bits acceptDocs = ctx.reader().getLiveDocs(); + return ctx.reader() + .searchNearestVectors(field, target, kPerLeaf, acceptDocs, Integer.MAX_VALUE); + } else { + BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord); + if (filterIterator == null || filterIterator.cost() == 0) { + return NO_RESULTS; + } + + if (filterIterator.cost() <= k) { + // If there <= k possible matches, short-circuit and perform exact search, since HNSW must + // always visit at least k documents + return exactSearch(ctx, target, k, filterIterator); + } + + try { + // The filter iterator already incorporates live docs + Bits acceptDocs = filterIterator.getBitSet(); + int visitedLimit = (int) filterIterator.cost(); + return ctx.reader().searchNearestVectors(field, target, kPerLeaf, acceptDocs, visitedLimit); + } catch ( + @SuppressWarnings("unused") + CollectionTerminatedException e) { + // We stopped the kNN search because it visited too many nodes, so fall back to exact search + return exactSearch(ctx, target, k, filterIterator); + } } + } - TopDocs results = ctx.reader().searchNearestVectors(field, target, kPerLeaf, bitsFilter); - if (results == null) { + private TopDocs exactSearch( + LeafReaderContext context, float[] target, int k, DocIdSetIterator acceptIterator) + throws IOException { + FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field); + if (fi == null || fi.getVectorDimension() == 0) { + // The field does not exist or does not index vectors return NO_RESULTS; } - if (ctx.docBase > 0) { - for (ScoreDoc scoreDoc : results.scoreDocs) { - scoreDoc.doc += ctx.docBase; - } + + VectorSimilarityFunction similarityFunction = fi.getVectorSimilarityFunction(); + VectorValues vectorValues = context.reader().getVectorValues(field); + + HitQueue queue = new HitQueue(k, false); Review comment: Did you consider using the pre-populated version? We might be creating and discarding a lot of `ScoreDoc`s here. ########## File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java ########## @@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws IOException { return createRewrittenQuery(reader, topK); } - private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits bitsFilter) + private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, BitSetCollector filterCollector) throws IOException { - // If the filter is non-null, then it already handles live docs - if (bitsFilter == null) { - bitsFilter = ctx.reader().getLiveDocs(); + + if (filterCollector == null) { + Bits acceptDocs = ctx.reader().getLiveDocs(); + return ctx.reader() + .searchNearestVectors(field, target, kPerLeaf, acceptDocs, Integer.MAX_VALUE); + } else { + BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord); + if (filterIterator == null || filterIterator.cost() == 0) { Review comment: A first I was worried about treating `0` as a hard filter since `cost` is only heuristic, generally, but in this case the `filterCollector` was produced through exhaustive iteration so its `cost` is exact and this is OK. ########## File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java ########## @@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws IOException { return createRewrittenQuery(reader, topK); } - private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits bitsFilter) + private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, BitSetCollector filterCollector) throws IOException { - // If the filter is non-null, then it already handles live docs - if (bitsFilter == null) { - bitsFilter = ctx.reader().getLiveDocs(); + + if (filterCollector == null) { + Bits acceptDocs = ctx.reader().getLiveDocs(); + return ctx.reader() + .searchNearestVectors(field, target, kPerLeaf, acceptDocs, Integer.MAX_VALUE); + } else { + BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord); + if (filterIterator == null || filterIterator.cost() == 0) { + return NO_RESULTS; + } + + if (filterIterator.cost() <= k) { + // If there <= k possible matches, short-circuit and perform exact search, since HNSW must + // always visit at least k documents + return exactSearch(ctx, target, k, filterIterator); + } + + try { + // The filter iterator already incorporates live docs + Bits acceptDocs = filterIterator.getBitSet(); + int visitedLimit = (int) filterIterator.cost(); + return ctx.reader().searchNearestVectors(field, target, kPerLeaf, acceptDocs, visitedLimit); Review comment: Oh, I see, annoyingly `searchNearestVectors` can return `null`, and that is endemic in `CodecReader` :) We could add a helper function here `emptyIfNull()` or so, but it's purely cosmetic, not really needed. ########## File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java ########## @@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws IOException { return createRewrittenQuery(reader, topK); } - private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits bitsFilter) + private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, BitSetCollector filterCollector) throws IOException { - // If the filter is non-null, then it already handles live docs - if (bitsFilter == null) { - bitsFilter = ctx.reader().getLiveDocs(); + + if (filterCollector == null) { + Bits acceptDocs = ctx.reader().getLiveDocs(); + return ctx.reader() + .searchNearestVectors(field, target, kPerLeaf, acceptDocs, Integer.MAX_VALUE); + } else { + BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord); + if (filterIterator == null || filterIterator.cost() == 0) { + return NO_RESULTS; + } + + if (filterIterator.cost() <= k) { + // If there <= k possible matches, short-circuit and perform exact search, since HNSW must + // always visit at least k documents + return exactSearch(ctx, target, k, filterIterator); + } + + try { + // The filter iterator already incorporates live docs + Bits acceptDocs = filterIterator.getBitSet(); + int visitedLimit = (int) filterIterator.cost(); + return ctx.reader().searchNearestVectors(field, target, kPerLeaf, acceptDocs, visitedLimit); + } catch ( + @SuppressWarnings("unused") + CollectionTerminatedException e) { + // We stopped the kNN search because it visited too many nodes, so fall back to exact search + return exactSearch(ctx, target, k, filterIterator); + } } + } - TopDocs results = ctx.reader().searchNearestVectors(field, target, kPerLeaf, bitsFilter); - if (results == null) { + private TopDocs exactSearch( + LeafReaderContext context, float[] target, int k, DocIdSetIterator acceptIterator) + throws IOException { + FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field); + if (fi == null || fi.getVectorDimension() == 0) { + // The field does not exist or does not index vectors return NO_RESULTS; } - if (ctx.docBase > 0) { - for (ScoreDoc scoreDoc : results.scoreDocs) { - scoreDoc.doc += ctx.docBase; - } + + VectorSimilarityFunction similarityFunction = fi.getVectorSimilarityFunction(); + VectorValues vectorValues = context.reader().getVectorValues(field); + + HitQueue queue = new HitQueue(k, false); + DocIdSetIterator iterator = + ConjunctionUtils.intersectIterators(List.of(acceptIterator, vectorValues)); + + int doc; + while ((doc = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { + float[] vector = vectorValues.vectorValue(); + float score = similarityFunction.convertToScore(similarityFunction.compare(vector, target)); + queue.insertWithOverflow(new ScoreDoc(doc, score)); } - return results; + ScoreDoc[] topScoreDocs = new ScoreDoc[queue.size()]; + for (int i = topScoreDocs.length - 1; i >= 0; i--) { + topScoreDocs[i] = queue.pop(); + } + + TotalHits totalHits = new TotalHits(acceptIterator.cost(), TotalHits.Relation.EQUAL_TO); Review comment: Right, we can report `EQUAL_TO` since we exhaustively iterated ########## File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java ########## @@ -96,43 +107,98 @@ public Query rewrite(IndexReader reader) throws IOException { return createRewrittenQuery(reader, topK); } - private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, Bits bitsFilter) + private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf, BitSetCollector filterCollector) throws IOException { - // If the filter is non-null, then it already handles live docs - if (bitsFilter == null) { - bitsFilter = ctx.reader().getLiveDocs(); + + if (filterCollector == null) { + Bits acceptDocs = ctx.reader().getLiveDocs(); + return ctx.reader() + .searchNearestVectors(field, target, kPerLeaf, acceptDocs, Integer.MAX_VALUE); + } else { + BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord); + if (filterIterator == null || filterIterator.cost() == 0) { + return NO_RESULTS; + } + + if (filterIterator.cost() <= k) { + // If there <= k possible matches, short-circuit and perform exact search, since HNSW must + // always visit at least k documents + return exactSearch(ctx, target, k, filterIterator); + } + + try { + // The filter iterator already incorporates live docs + Bits acceptDocs = filterIterator.getBitSet(); + int visitedLimit = (int) filterIterator.cost(); + return ctx.reader().searchNearestVectors(field, target, kPerLeaf, acceptDocs, visitedLimit); + } catch ( + @SuppressWarnings("unused") + CollectionTerminatedException e) { Review comment: I could go either way with this one. I tend to lean towards using `TopDocs.totalHits.value` since we already use it for returning visited counts; we could return with a null or maybe empty `scoreDocs` in that case? Or perhaps there could be a use case for returning the "best effort" results obtained by visiting a limited subset of the graph, and we should in fact marshal up the results. Generally I don't favor using Exceptions for expected behavior, but also I think if we do choose this pattern we should create a new Exception type just for this case. ########## File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java ########## @@ -76,17 +81,23 @@ public KnnVectorQuery(String field, float[] target, int k, Query filter) { @Override public Query rewrite(IndexReader reader) throws IOException { - BitSet[] bitSets = null; + TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()]; + BitSetCollector filterCollector = null; if (filter != null) { + filterCollector = new BitSetCollector(reader.leaves().size()); IndexSearcher indexSearcher = new IndexSearcher(reader); - bitSets = new BitSet[reader.leaves().size()]; - indexSearcher.search(filter, new BitSetCollector(bitSets)); + indexSearcher.search(filter, filterCollector); } - TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()]; for (LeafReaderContext ctx : reader.leaves()) { - perLeafResults[ctx.ord] = searchLeaf(ctx, k, bitSets != null ? bitSets[ctx.ord] : null); + TopDocs results = searchLeaf(ctx, k, filterCollector); + if (results != null && ctx.docBase > 0) { + for (ScoreDoc scoreDoc : results.scoreDocs) { + scoreDoc.doc += ctx.docBase; + } + } + perLeafResults[ctx.ord] = results != null ? results : NO_RESULTS; Review comment: can `searchLeaf` actually return `null`? Should we make it return `NO_RESULTS` then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org