[ 
https://issues.apache.org/jira/browse/SOLR-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024592#comment-18024592
 ] 

Ilan Ginzburg commented on SOLR-17319:
--------------------------------------

Ufuk: ??if I have a client application that queries a multi-shard collection in 
Solr using different (vector, keyword) methods and implements RRF logic itself, 
does it technically give the correct result???

Yes, but... To get the best 100 post RRF results with such a strategy, one 
needs to fetch more than 100 results from each mode (vector, keyword). A doc 
not in any of the two result sets top 100 can have a RRF score pushing it into 
the top 100 of the fusion.

Taking the most extreme example to make my point (in practice this can happen 
with more realistic scenarios): the first 99 docs of both keyword and vector 
result sets are identical (not necessarily in same order though), and therefore 
they occupy the top 99 slots of the post RRF result set. Assume 100 in the 
vector results for example does not appear in the keyword results at all. Its 
RRF score is therefore 1/(k+100).

There can be a document that is not in the top 100 of any of the two results 
set that can have a score higher than this doc at position 100.

The worst case is that such a doc is at position 101 in one of the result sets 
(keyword, vector), and in a lower position in the other result set.

If that doc is at position p in the other result set, it fusion score is 
1/(k+101) + 1/(k+p) (score at position 101 + score at position p). For this doc 
to make it into the top 100 post fusion, it needs to satisfy: 1/(k+101) + 
1/(k+p) > 1/(k+100), i.e. have a higher score than the current doc at position 
100 that appears only in one of the two result sets.

The highest p that satisfies this inequation (assuming k=60) is 25699. So to 
get a formally correct RRF to fetch the top 100 docs, one would need to fetch 
25699 keyword results and as many vector results. This is the worst case, 
moreover I assume there can be many optimisations for early termination, but 
you get the idea.

Computing the fusion per shard (and not doing the exact RRF calculation) has 
merit.

> Introduce support for Reciprocal Rank Fusion (combining queries)
> ----------------------------------------------------------------
>
>                 Key: SOLR-17319
>                 URL: https://issues.apache.org/jira/browse/SOLR-17319
>             Project: Solr
>          Issue Type: New Feature
>          Components: vector-search
>    Affects Versions: 9.6.1
>            Reporter: Alessandro Benedetti
>            Assignee: Alessandro Benedetti
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 28h 20m
>  Remaining Estimate: 0h
>
> Reciprocal Rank Fusion (RRF) is an algorithm that takes in input multiple 
> ranked lists to produce a unified result set. 
> Examples of use cases where RRF can be used include hybrid search and 
> multiple Knn vector queries executed concurrently. 
> RRF is based on the concept of reciprocal rank, which is the inverse of the 
> rank of a document in a ranked list of search results. 
> The combination of search results happens taking into account the position of
>  the items in the original rankings, and giving higher score to items that 
> are ranked higher in multiple lists. RRF was introduced the first time by 
> Cormack et al. in [1].
> The syntax proposed:
> JSON Request
> {code:json}
> {
>     "queries": {
>         "lexical1": {
>             "lucene": {
>                 "query": "id:(10^=2 OR 2^=1 OR 4^=0.5)"
>             }
>         },
>         "lexical2": {
>             "lucene": {
>                 "query": "id:(2^=2 OR 4^=1 OR 3^=0.5)"
>             }
>         }
>     },
>     "limit": 10,
>     "fields": "[id,score]",
>     "params": {
>         "combiner": true,
>         "combiner.upTo": 5,
>         "facet": true,
>         "facet.field": "id",
>         "facet.mincount": 1
>     }
> }
> {code}
> [1] Cormack, Gordon V. et al. “Reciprocal rank fusion outperforms condorcet 
> and individual rank learning methods.” Proceedings of the 32nd international 
> ACM SIGIR conference on Research and development in information retrieval 
> (2009)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to