[ https://issues.apache.org/jira/browse/SOLR-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018260#comment-18018260 ]
David Smiley commented on SOLR-17319: ------------------------------------- Way 1: You mention that Chris referenced this but I want to mention that he did so negatively, not in favor. (AFAICT) Quoting him: {quote}Isn't any approach that computes a "score" based on the RRF _per shard_ and _then_ merges the per-shard results to find the topN results by defiition "wrong" according to the RRF formula? (or at least: the RRF formula as i understand it?) {quote} I completely agree with Chris, and I affirmed equivalent statements in my last summary. Only "Way 2" is correct. I'm not sure I understand/agree with what you said about "Way 2" so I'd like to offer what I think it'd look like: Way 2: A new QueryComponent subclass or collaborating SearchComponent shall arrange to execute the sub-queries concurrently using distributed-search (thus across shards) to get a complete (whole corpus) ranked list of offset+rows docs of them. It then shall merge and rank them according to RRF. Then consider offset & rows to derive the correct DocSlice (page). Details: * I'm not sure if the SearchComponent distributed-search protocol/API can process sub-queries in parallel somehow. It can do shards in parallel but not sure about N sub-queries in parallel. It's a complicated under-documented protocol as well. But certainly a component could use a ShardHandler's executor to independently do the requests, maybe using EmbeddedSolrServer to talk to the current core. * I recommend forcing distributed-search / shortCircuit=false somehow so that you basically have one distributed implementation to code/maintain/test instead of two, thus not doing a separate single-shard optimized code path. * Faceting or other queries requiring a DocSet could be initially unsupported and added later. A trick would be to participate in the distributed-search protocol but exclude the DocSlice (e.g. by setting rows=0) since that portion of the results is handled separately with the sub-queries mechanism just described; we don't need/want QueryComponent to get the top docs on its own. The sharded query must do a disjunction of the sub-queries (logical OR) when it needs the DocSet, like by simply setting the query to be that. * I think there's less concern of interfacing with QueryComponent's existing code / code-duplication concerns. > Introduce support for Reciprocal Rank Fusion (combining queries) > ---------------------------------------------------------------- > > Key: SOLR-17319 > URL: https://issues.apache.org/jira/browse/SOLR-17319 > Project: Solr > Issue Type: New Feature > Components: vector-search > Affects Versions: 9.6.1 > Reporter: Alessandro Benedetti > Assignee: Alessandro Benedetti > Priority: Major > Labels: pull-request-available > Time Spent: 23h 10m > Remaining Estimate: 0h > > Reciprocal Rank Fusion (RRF) is an algorithm that takes in input multiple > ranked lists to produce a unified result set. > Examples of use cases where RRF can be used include hybrid search and > multiple Knn vector queries executed concurrently. > RRF is based on the concept of reciprocal rank, which is the inverse of the > rank of a document in a ranked list of search results. > The combination of search results happens taking into account the position of > the items in the original rankings, and giving higher score to items that > are ranked higher in multiple lists. RRF was introduced the first time by > Cormack et al. in [1]. > The syntax proposed: > JSON Request > {code:json} > { > "queries": { > "lexical1": { > "lucene": { > "query": "id:(10^=2 OR 2^=1 OR 4^=0.5)" > } > }, > "lexical2": { > "lucene": { > "query": "id:(2^=2 OR 4^=1 OR 3^=0.5)" > } > } > }, > "limit": 10, > "fields": "[id,score]", > "params": { > "combiner": true, > "combiner.upTo": 5, > "facet": true, > "facet.field": "id", > "facet.mincount": 1 > } > } > {code} > [1] Cormack, Gordon V. et al. “Reciprocal rank fusion outperforms condorcet > and individual rank learning methods.” Proceedings of the 32nd international > ACM SIGIR conference on Research and development in information retrieval > (2009) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org