[ https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539350#comment-17539350 ]
Kaival Parikh commented on LUCENE-10559: ---------------------------------------- We can maintain a *FixedBitSet* to record passing docs, and randomly *set* bits based on a *selectivity* argument, so over a large set of docs we would have a close proportion of passing docs We can use this bitset both while finding brute force NN in *computeNN* and HNSW kNN in {*}testSearch{*}. To incorporate pre-filtering, we can take an additional argument *prefilter* which wraps our bitset into a query, and passes it to *doKnnVectorQuery* to give pre-filtered results For post-filtering, we can request *topK/selectivity* number of results, to get close to *topK* after filtering out non passing docs I ran some tests with these changes (topK = 100, numDocs = 1M (wikimedia dataset), numIter = 1K): ||selectivity||effective topK||post-filter recall||post-filter time||pre-filter recall||pre-filter time|| |0.8|125|0.974|1.67|0.987|1.71| |0.6|166|0.965|2.06|0.989|2.09| |0.4|250|0.967|2.85|0.991|2.83| |0.2|500|0.941|5.06|0.996|4.85| |0.1|1000|0.972|8.81|0.997|7.99| |0.01|10000|0.953|59.37|1.000|9.76| Note: I have directly updated the bitset in *BitSetCollector* to work around the arbitrary time it takes to *collect* results by iterating over the entire bitset (as it would be different for different types of queries, and {_}has to be considered separately{_}) The changes are available [here|https://github.com/kaivalnp/lucene/commit/c985fad694dfff4e9341649228b3a08d64c98a7f] in a separate branch. Looking for suggestions/comments on this approach > Add preFilter/postFilter options to KnnGraphTester > -------------------------------------------------- > > Key: LUCENE-10559 > URL: https://issues.apache.org/jira/browse/LUCENE-10559 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael Sokolov > Priority: Major > > We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: > if you (say) want the top K nearest neighbors subject to a constraint Q, are > you better off over-selecting (say 2K) top hits and *then* filtering > (post-filtering), or incorporating the filtering into the query > (pre-filtering). How does it depend on the selectivity of the filter? > I think we can get a reasonable testbed by generating a uniform random filter > with some selectivity (that is consistent and repeatable). Possibly we'd also > want to try filters that are correlated with index order, but it seems they'd > be unlikely to be correlated with vector values in a way that the graph > structure would notice, so random is a pretty good starting point for this. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org