[ 
https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539350#comment-17539350
 ] 

Kaival Parikh commented on LUCENE-10559:
----------------------------------------

We can maintain a *FixedBitSet* to record passing docs, and randomly *set* bits 
based on a *selectivity* argument, so over a large set of docs we would have a 
close proportion of passing docs

We can use this bitset both while finding brute force NN in *computeNN* and 
HNSW kNN in {*}testSearch{*}.

To incorporate pre-filtering, we can take an additional argument *prefilter* 
which wraps our bitset into a query, and passes it to *doKnnVectorQuery* to 
give pre-filtered results

For post-filtering, we can request *topK/selectivity* number of results, to get 
close to *topK* after filtering out non passing docs

I ran some tests with these changes (topK = 100, numDocs = 1M (wikimedia 
dataset), numIter = 1K):
||selectivity||effective topK||post-filter recall||post-filter time||pre-filter 
recall||pre-filter time||
|0.8|125|0.974|1.67|0.987|1.71|
|0.6|166|0.965|2.06|0.989|2.09|
|0.4|250|0.967|2.85|0.991|2.83|
|0.2|500|0.941|5.06|0.996|4.85|
|0.1|1000|0.972|8.81|0.997|7.99|
|0.01|10000|0.953|59.37|1.000|9.76|

Note: I have directly updated the bitset in *BitSetCollector* to work around 
the arbitrary time it takes to *collect* results by iterating over the entire 
bitset (as it would be different for different types of queries, and {_}has to 
be considered separately{_})

The changes are available 
[here|https://github.com/kaivalnp/lucene/commit/c985fad694dfff4e9341649228b3a08d64c98a7f]
 in a separate branch. Looking for suggestions/comments on this approach

> Add preFilter/postFilter options to KnnGraphTester
> --------------------------------------------------
>
>                 Key: LUCENE-10559
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10559
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Major
>
> We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: 
> if you (say) want the top K nearest neighbors subject to a constraint Q, are 
> you better off over-selecting (say 2K) top hits and *then* filtering 
> (post-filtering), or incorporating the filtering into the query 
> (pre-filtering). How does it depend on the selectivity of the filter?
> I think we can get a reasonable testbed by generating a uniform random filter 
> with some selectivity (that is consistent and repeatable). Possibly we'd also 
> want to try filters that are correlated with index order, but it seems they'd 
> be unlikely to be correlated with vector values in a way that the graph 
> structure would notice, so random is a pretty good starting point for this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to