Just for your curiosity, my Reciprocal Rank Fusion contribution to Solr is in decent shape now: https://github.com/apache/solr/pull/2489 Everything is just Solr's side but maybe it can be of some sort of inspiration if you want to do a similar work in Lucene.
Cheers -------------------------- *Alessandro Benedetti* Director @ Sease Ltd. *Apache Lucene/Solr Committer* *Apache Solr PMC Member* e-mail: a.benede...@sease.io *Sease* - Information Retrieval Applied Consulting | Training | Open Source Website: Sease.io <http://sease.io/> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter <https://twitter.com/seaseltd> | Youtube <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github <https://github.com/seaseltd> On Mon, 20 May 2024 at 08:16, Michael Wechner <michael.wech...@wyona.com> wrote: > Hi Hank > > Very cool, thank you, will try to do this asap! > > All the best > > Michael > > > Am 19.05.24 um 01:42 schrieb Chang Hank: > > Hey Michael, > > I wrote the first version of my idea about implementing RRF in Lucene, > here the link of the code > https://gist.github.com/hack4chang/ee2b37eab80bd82e574ff4f94ed204e9. > Right now I have some questions, one is about the shardIndex to be > returned, another one is the TotalHits value, please take a look at the > code and kindly leave some comments below. > > Thanks, > Hank > > On May 18, 2024, at 2:01 PM, Chang Hank <hackchang0...@gmail.com> > <hackchang0...@gmail.com> wrote: > > Or maybe we can first create an issue and PR based on the issue number? > WDYT? > > Best, > > Hank > > On May 18, 2024, at 11:29 AM, Chang Hank <hackchang0...@gmail.com> > <hackchang0...@gmail.com> wrote: > > Hey Michael, > > Sorry I was a bit busy this week, but I’ve looked into the resources you > provided and also some useful advice from Alessandro and Adrien. > > I have a briefly understanding of how RRF works, but I’m not quite sure > how we should implement it. Based on the advice from Alessandro and Adrien, > it seems we need to consider that the search results are located at > different shards. According to Alessandro, we should aggregate the ranked > lists from all distributed nodes and then apply RRF. > Are we going to implement this aggregation logic inside our RRF method? > > Also could you please create a PR so we can discuss more details further? > > All the best, > > Hank > > On May 13, 2024, at 10:09 AM, Michael Wechner <michael.wech...@wyona.com> > <michael.wech...@wyona.com> wrote: > > Great, sounds like we have plan :-) > > Hank and I can get started trying to understand the internals better ... > > Thanks > > Michael > > Am 13.05.24 um 18:21 schrieb Alessandro Benedetti: > > Sure, we can make it work but in a distributed environment you have to run > first each query distributed (aggregating all nodes) and then RRF on top of > the aggregated ranked lists. > Doing RRF per node first and then aggregate per shard won't return the > same results I suspect. > When I go back to working on the task I'll be able to elaborate more! > > Cheers > -------------------------- > *Alessandro Benedetti* > Director @ Sease Ltd. > *Apache Lucene/Solr Committer* > *Apache Solr PMC Member* > > e-mail: a.benede...@sease.io > > > *Sease* - Information Retrieval Applied > Consulting | Training | Open Source > > Website: Sease.io <http://sease.io/> > LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter > <https://twitter.com/seaseltd> | Youtube > <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github > <https://github.com/seaseltd> > > > On Mon, 13 May 2024 at 14:12, Adrien Grand <jpou...@gmail.com> wrote: > >> > Maybe Adrien Grand and others might also have some feedback :-) >> >> I'd suggest the signature to look something like `TopDocs TopDocs#rrf(int >> topN, int k, TopDocs[] hits)` to be consistent with `TopDocs#merge`. >> Internally, it should look at `ScoreDoc#shardId` and `ScoreDoc#doc` to >> figure out which hits map to the same document. >> >> > Back in the day, I was reasoning on this and I didn't think Lucene was >> the right place for an interleaving algorithm, given that Reciprocal Rank >> Fusion is affected by distribution and it's not supposed to work per node. >> >> To me this is like `TopDocs#merge`. There are changes needed on the >> application side to hook this call into the logic that combines hits that >> come from multiple shards (multiple queries in the case of RRF), but Lucene >> can still provide the merging logic. >> >> On Mon, May 13, 2024 at 1:41 PM Michael Wechner < >> michael.wech...@wyona.com> wrote: >> >>> Thanks for your feedback Alessandro! >>> >>> I am using Lucene independent of Solr or OpenSearch, Elasticsearch, but >>> would like to combine different result sets using RRF, therefore think that >>> Lucene itself could be a good place actually. >>> >>> Looking forward to your additional elaboration! >>> >>> Thanks >>> >>> Michael >>> >>> >>> >>> >>> Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti < >>> a.benede...@sease.io>: >>> >>> This is not strictly related to Lucene, but I'll give a talk at Berlin >>> Buzzwords on how I am implementing Reciprocal Rank Fusion in Apache Solr. >>> I'll resume my work on the contribution next week and have more to share >>> later. >>> >>> Back in the day, I was reasoning on this and I didn't think Lucene was >>> the right place for an interleaving algorithm, given that Reciprocal Rank >>> Fusion is affected by distribution and it's not supposed to work per node. >>> I think I evaluated the possibility of doing it as a Lucene query or a >>> Lucene component but then ended up with a different approach. >>> I'll elaborate more when I go back to the task! >>> >>> Cheers >>> -------------------------- >>> *Alessandro Benedetti* >>> Director @ Sease Ltd. >>> *Apache Lucene/Solr Committer* >>> *Apache Solr PMC Member* >>> >>> e-mail: a.benede...@sease.io >>> >>> >>> *Sease* - Information Retrieval Applied >>> Consulting | Training | Open Source >>> >>> Website: Sease.io <http://sease.io/> >>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>> <https://twitter.com/seaseltd> | Youtube >>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>> <https://github.com/seaseltd> >>> >>> >>> On Sat, 11 May 2024 at 09:10, Michael Wechner <michael.wech...@wyona.com> >>> wrote: >>> >>>> sure, no problem! >>>> >>>> Maybe Adrien Grand and others might also have some feedback :-) >>>> >>>> Thanks >>>> >>>> Michael >>>> >>>> Am 10.05.24 um 23:03 schrieb Chang Hank: >>>> >>>> Thank you for these useful resources, please allow me to spend some >>>> time look into it. >>>> I’ll let you know asap!! >>>> >>>> Thanks >>>> >>>> Hank >>>> >>>> On May 10, 2024, at 12:34 PM, Michael Wechner >>>> <michael.wech...@wyona.com> <michael.wech...@wyona.com> wrote: >>>> >>>> also we might want to consider how this relates to >>>> >>>> >>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html >>>> >>>> In vector search reranking has become quite popular, e.g. >>>> >>>> https://docs.cohere.com/docs/reranking >>>> >>>> IIUC LangChain (python) for example adds the reranker as an argument to >>>> the searcher/retriever >>>> >>>> >>>> https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/ >>>> >>>> So maybe the following might make sense as well >>>> >>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10); >>>> TopDocs topDocsVector = vectorSearcher.search(query, 50, new >>>> CohereReranker()); >>>> >>>> TopDocs topDocs = TopDocs.merge(new RRFRanker(), topDocsKeyword, >>>> topDocsVector); >>>> >>>> WDYT? >>>> >>>> Thanks >>>> >>>> Michael >>>> >>>> >>>> Am 10.05.24 um 21:08 schrieb Michael Wechner: >>>> >>>> great, yes, let's get started :-) >>>> >>>> What about the following pseudo code, assuming that there might be >>>> alternative ranking algorithms to RRF >>>> >>>> StoredFieldsKeyword storedFieldsKeyword = >>>> indexReaderKeyword.storedFields(); >>>> StoredFieldsVector storedFieldsVector = >>>> indexReaderKeyword.storedFields(); >>>> >>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10); >>>> TopDocs topDocsVector = vectorSearcher.search(vectorQuery, 50); >>>> >>>> Ranker ranker = new RRFRanker(); >>>> TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword, topDocsVector); >>>> >>>> for (ScoreDoc scoreDoc : topDocs.scoreDocs) { >>>> Document docK = storedFieldsKeyword.document(scoreDoc.doc); >>>> Document docV = storedFieldsVector.document(scoreDoc.doc); >>>> .... >>>> } >>>> >>>> whereas also see >>>> >>>> >>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html >>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html >>>> >>>> WDYT? >>>> >>>> Thanks >>>> >>>> Michael >>>> >>>> >>>> >>>> >>>> Am 10.05.24 um 20:01 schrieb Chang Hank: >>>> >>>> Hi Michael, >>>> >>>> Sounds good to me. >>>> Let’s do it!! >>>> >>>> Cheers, >>>> Hank >>>> >>>> On May 10, 2024, at 10:50 AM, Michael Wechner >>>> <michael.wech...@wyona.com> <michael.wech...@wyona.com> wrote: >>>> >>>> Hi Hank >>>> >>>> Very cool! >>>> >>>> Adrien Grand suggested to implement it as a utility method on the >>>> TopDocs class, and since Adrien worked for a decade on Lucene >>>> https://www.elastic.co/de/blog/author/adrien-grand I guess it makes >>>> sense to follow his advice :-) We could create a PR and work together on >>>> it, WDYT? All the best Michael >>>> Am 10.05.24 um 18:51 schrieb Chang Hank: >>>> >>>> Hi Michael, >>>> >>>> Thank you for the reply. >>>> This is really a cool issue to work on, I’m happy to work on this >>>> with you. I’ll try to do research on RRF first. >>>> Also, are we going to implement this on the TopDocs class? >>>> >>>> Best, >>>> Hank >>>> >>>> >>>> On May 9, 2024, at 11:08 PM, Michael Wechner >>>> <michael.wech...@wyona.com> <michael.wech...@wyona.com> wrote: >>>> >>>> Hi Hank >>>> >>>> Thanks for offering your help! >>>> >>>> I recently suggested to implement RRF (Reciprocal Rank Fusion) >>>> >>>> https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz >>>> >>>> but still have not found the time to really work on this. >>>> >>>> Maybe you would be interested to do this or that we work on it together >>>> somehow? >>>> >>>> Thanks >>>> >>>> Michael >>>> >>>> >>>> >>>> Am 10.05.24 um 07:27 schrieb Chang Hank: >>>> >>>> Hi everyone, >>>> >>>> I’m Hank Chang, currently studying Information Retrieval topics. I’m >>>> really interested in contributing to Apache Lucene and enhance my >>>> understanding to the field. >>>> I’ve reviewed several issues posted on the Github repository but >>>> haven’t found a straightforward starting point. Could someone please >>>> recommend suitable issues for a newcomer like me or suggest areas I could >>>> assist with? >>>> >>>> Thank you for your time and guidance. >>>> >>>> Best regards, >>>> Hank Chang >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >> -- >> Adrien >> > > > > > >