[ 
https://issues.apache.org/jira/browse/SOLR-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018260#comment-18018260
 ] 

David Smiley commented on SOLR-17319:
-------------------------------------

Way 1:  You mention that Chris referenced this but I want to mention that he 
did so negatively, not in favor.  (AFAICT)  Quoting him:
{quote}Isn't any approach that computes a "score" based on the RRF _per shard_ 
and _then_ merges the per-shard results to find the topN results by defiition 
"wrong" according to the RRF formula? (or at least: the RRF formula as i 
understand it?)
{quote}
I completely agree with Chris, and I affirmed equivalent statements in my last 
summary.  Only "Way 2" is correct.  I'm not sure I understand/agree with what 
you said about "Way 2" so I'd like to offer what I think it'd look like:

Way 2:  A new QueryComponent subclass or collaborating SearchComponent shall 
arrange to execute the sub-queries concurrently using distributed-search (thus 
across shards) to get a complete (whole corpus) ranked list of offset+rows docs 
of them.  It then shall merge and rank them according to RRF.  Then consider 
offset & rows to derive the correct DocSlice (page).

Details: 
 * I'm not sure if the SearchComponent distributed-search protocol/API can 
process sub-queries in parallel somehow.  It can do shards in parallel but not 
sure about N sub-queries in parallel.  It's a complicated under-documented 
protocol as well.  But certainly a component could use a ShardHandler's 
executor to independently do the requests, maybe using EmbeddedSolrServer to 
talk to the current core.
 * I recommend forcing distributed-search / shortCircuit=false somehow so that 
you basically have one distributed implementation to code/maintain/test instead 
of two, thus not doing a separate single-shard optimized code path.
 * Faceting or other queries requiring a DocSet could be initially unsupported 
and added later.  A trick would be to participate in the distributed-search 
protocol but exclude the DocSlice (e.g. by setting rows=0) since that portion 
of the results is handled separately with the sub-queries mechanism just 
described; we don't need/want QueryComponent to get the top docs on its own.  
The sharded query must do a disjunction of the sub-queries (logical OR) when it 
needs the DocSet, like by simply setting the query to be that.
 * I think there's less concern of interfacing with QueryComponent's existing 
code / code-duplication concerns.

> Introduce support for Reciprocal Rank Fusion (combining queries)
> ----------------------------------------------------------------
>
>                 Key: SOLR-17319
>                 URL: https://issues.apache.org/jira/browse/SOLR-17319
>             Project: Solr
>          Issue Type: New Feature
>          Components: vector-search
>    Affects Versions: 9.6.1
>            Reporter: Alessandro Benedetti
>            Assignee: Alessandro Benedetti
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 23h 10m
>  Remaining Estimate: 0h
>
> Reciprocal Rank Fusion (RRF) is an algorithm that takes in input multiple 
> ranked lists to produce a unified result set. 
> Examples of use cases where RRF can be used include hybrid search and 
> multiple Knn vector queries executed concurrently. 
> RRF is based on the concept of reciprocal rank, which is the inverse of the 
> rank of a document in a ranked list of search results. 
> The combination of search results happens taking into account the position of
>  the items in the original rankings, and giving higher score to items that 
> are ranked higher in multiple lists. RRF was introduced the first time by 
> Cormack et al. in [1].
> The syntax proposed:
> JSON Request
> {code:json}
> {
>     "queries": {
>         "lexical1": {
>             "lucene": {
>                 "query": "id:(10^=2 OR 2^=1 OR 4^=0.5)"
>             }
>         },
>         "lexical2": {
>             "lucene": {
>                 "query": "id:(2^=2 OR 4^=1 OR 3^=0.5)"
>             }
>         }
>     },
>     "limit": 10,
>     "fields": "[id,score]",
>     "params": {
>         "combiner": true,
>         "combiner.upTo": 5,
>         "facet": true,
>         "facet.field": "id",
>         "facet.mincount": 1
>     }
> }
> {code}
> [1] Cormack, Gordon V. et al. “Reciprocal rank fusion outperforms condorcet 
> and individual rank learning methods.” Proceedings of the 32nd international 
> ACM SIGIR conference on Research and development in information retrieval 
> (2009)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to