Pulkitg64 opened a new issue, #12414: URL: https://github.com/apache/lucene/issues/12414
### Description **Context** In KNN Prefiltering we require a ```BitSet``` of docs which matched with prefilter query. This BitSet is used during HNSW Graph Search to check whether a node is accepted for matching or not. To create this BitSet we call [createBitSet](https://github.com/apache/lucene/blob/9ffc625b2ec43481e2cfba29828efbddf12e6852/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L161) function which creates a BitSet with following conditions: 1. When there are **no deletions** in the index and ```DocIdSetIterator``` is of ```BitSetIterator``` instance, then directly return the BitSet. 2. For other cases create a ```FilteredDocIdSetIterator``` first then create ```BitSet``` by iterating over docs from the ```FilteredDocIdSetIterator```. ``` private BitSet createBitSet(DocIdSetIterator iterator, Bits liveDocs, int maxDoc) throws IOException { if (liveDocs == null && iterator instanceof BitSetIterator bitSetIterator) { // If we already have a BitSet and no deletions, reuse the BitSet return bitSetIterator.getBitSet(); } else { // Create a new BitSet from matching and live docs FilteredDocIdSetIterator filterIterator = new FilteredDocIdSetIterator(iterator) { @Override protected boolean match(int doc) { return liveDocs == null || liveDocs.get(doc); } }; return BitSet.of(filterIterator, maxDoc); } } ``` **Problem** When the ```DocIdSetIterator``` is of ```BitSetIterator``` instance and there are **deletions** in the index, we create a FilteredDocIdSetIterator which is later used to create a BitSet. This is a time consuming process because to create the BitSet we iterate over all matching docs from the iterator. **Proposed Optimisation** For the above case, we can wrap matchedDocs and liveDocs under single Bits instance something like below which can be directly passed for HNSW search. So, now during HNSW Search when a node is explored, we will check if the doc is accepted or not by checking bits of both matchedDocs and liveDocs which is constant time operation. ``` public class KNNBitSet implements Bits{ Bits liveDocs; Bits matchedDocs; public KNNBitSet(Bits matchedDocs, Bits liveDocs) { this.matchedDocs = matchedDocs; this.liveDocs = liveDocs; } @Override public boolean get(int index) { if (liveDocs ==null) { return matchedDocs.get(index); } return matchedDocs.get(index) & liveDocs.get(index); } @Override public int length() { return 0; } } ``` **Advantage** This will reduce BitSet creation time for some cases (when there are deleted documents) because now we are not iterating over each docs to create the BitSet. **Drawback** There is one drawback in this approach, that we will not be able to give the exact cardinality of acceptedDocs and the number will be higher than exact number of acceptedDocs because now we are not considering liveDocs into the final BitSet. *Original*: Number of Matched Docs = 10, Number of Deleted Docs = 2 Cardinality of Accepted Docs = 10-2 = 8 *With Proposed Optimisation*: Number of Matched Docs = 10, Number of Deleted Docs = 2 Cardinality of Accepted Docs = 10 (This is an overestimation as we are not considering deleted docs in BitSet) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
