Re: Any recommended issues to work on for a newcomer?

Alessandro Benedetti Fri, 31 May 2024 09:50:49 -0700

Just for your curiosity, my Reciprocal Rank Fusion contribution to Solr is
in decent shape now:
https://github.com/apache/solr/pull/2489
Everything is just Solr's side but maybe it can be of some sort of
inspiration if you want to do a similar work in Lucene.


Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benede...@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Mon, 20 May 2024 at 08:16, Michael Wechner <michael.wech...@wyona.com>
wrote:

> Hi Hank
>
> Very cool, thank you, will try to do this asap!
>
> All the best
>
> Michael
>
>
> Am 19.05.24 um 01:42 schrieb Chang Hank:
>
> Hey Michael,
>
> I wrote the first version of my idea about implementing RRF in Lucene,
> here the link of the code
> https://gist.github.com/hack4chang/ee2b37eab80bd82e574ff4f94ed204e9.
> Right now I have some questions, one is about the shardIndex to be
> returned, another one is the TotalHits value, please take a look at the
> code and kindly leave some comments below.
>
> Thanks,
> Hank
>
> On May 18, 2024, at 2:01 PM, Chang Hank <hackchang0...@gmail.com>
> <hackchang0...@gmail.com> wrote:
>
> Or maybe we can first create an issue and PR based on the issue number?
> WDYT?
>
> Best,
>
> Hank
>
> On May 18, 2024, at 11:29 AM, Chang Hank <hackchang0...@gmail.com>
> <hackchang0...@gmail.com> wrote:
>
> Hey Michael,
>
> Sorry I was a bit busy this week, but I’ve looked into the resources you
> provided and also some useful advice from Alessandro and Adrien.
>
> I have a briefly understanding of how RRF works, but I’m not quite sure
> how we should implement it. Based on the advice from Alessandro and Adrien,
> it seems we need to consider that the search results are located at
> different shards. According to Alessandro, we should aggregate the ranked
> lists from all distributed nodes and then apply RRF.
> Are we going to implement this aggregation logic inside our RRF method?
>
> Also could you please create a PR so we can discuss more details further?
>
> All the best,
>
> Hank
>
> On May 13, 2024, at 10:09 AM, Michael Wechner <michael.wech...@wyona.com>
> <michael.wech...@wyona.com> wrote:
>
> Great, sounds like we have plan :-)
>
> Hank and I can get started trying to understand the internals better ...
>
> Thanks
>
> Michael
>
> Am 13.05.24 um 18:21 schrieb Alessandro Benedetti:
>
> Sure, we can make it work but in a distributed environment you have to run
> first each query distributed (aggregating all nodes) and then RRF on top of
> the aggregated ranked lists.
> Doing RRF per node first and then aggregate per shard won't return the
> same results I suspect.
> When I go back to working on the task I'll be able to elaborate more!
>
> Cheers
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Mon, 13 May 2024 at 14:12, Adrien Grand <jpou...@gmail.com> wrote:
>
>> > Maybe Adrien Grand and others might also have some feedback :-)
>>
>> I'd suggest the signature to look something like `TopDocs TopDocs#rrf(int
>> topN, int k, TopDocs[] hits)` to be consistent with `TopDocs#merge`.
>> Internally, it should look at `ScoreDoc#shardId` and `ScoreDoc#doc` to
>> figure out which hits map to the same document.
>>
>> > Back in the day, I was reasoning on this and I didn't think Lucene was
>> the right place for an interleaving algorithm, given that Reciprocal Rank
>> Fusion is affected by distribution and it's not supposed to work per node.
>>
>> To me this is like `TopDocs#merge`. There are changes needed on the
>> application side to hook this call into the logic that combines hits that
>> come from multiple shards (multiple queries in the case of RRF), but Lucene
>> can still provide the merging logic.
>>
>> On Mon, May 13, 2024 at 1:41 PM Michael Wechner <
>> michael.wech...@wyona.com> wrote:
>>
>>> Thanks for your feedback Alessandro!
>>>
>>> I am using Lucene independent of Solr or OpenSearch, Elasticsearch, but
>>> would like to combine different result sets using RRF, therefore think that
>>> Lucene itself could be a good place actually.
>>>
>>> Looking forward to your additional elaboration!
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>>
>>> Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti <
>>> a.benede...@sease.io>:
>>>
>>> This is not strictly related to Lucene, but I'll give a talk at Berlin
>>> Buzzwords on how I am implementing Reciprocal Rank Fusion in Apache Solr.
>>> I'll resume my work on the contribution next week and have more to share
>>> later.
>>>
>>> Back in the day, I was reasoning on this and I didn't think Lucene was
>>> the right place for an interleaving algorithm, given that Reciprocal Rank
>>> Fusion is affected by distribution and it's not supposed to work per node.
>>> I think I evaluated the possibility of doing it as a Lucene query or a
>>> Lucene component but then ended up with a different approach.
>>> I'll elaborate more when I go back to the task!
>>>
>>> Cheers
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benede...@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>>>
>>> On Sat, 11 May 2024 at 09:10, Michael Wechner <michael.wech...@wyona.com>
>>> wrote:
>>>
>>>> sure, no problem!
>>>>
>>>> Maybe Adrien Grand and others might also have some feedback :-)
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>> Am 10.05.24 um 23:03 schrieb Chang Hank:
>>>>
>>>> Thank you for these useful resources, please allow me to spend some
>>>> time look into it.
>>>> I’ll let you know asap!!
>>>>
>>>> Thanks
>>>>
>>>> Hank
>>>>
>>>> On May 10, 2024, at 12:34 PM, Michael Wechner
>>>> <michael.wech...@wyona.com> <michael.wech...@wyona.com> wrote:
>>>>
>>>> also we might want to consider how this relates to
>>>>
>>>>
>>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html
>>>>
>>>> In vector search reranking has become quite popular, e.g.
>>>>
>>>> https://docs.cohere.com/docs/reranking
>>>>
>>>> IIUC LangChain (python) for example adds the reranker as an argument to
>>>> the searcher/retriever
>>>>
>>>>
>>>> https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/
>>>>
>>>> So maybe the following might make sense as well
>>>>
>>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
>>>> TopDocs topDocsVector = vectorSearcher.search(query, 50, new
>>>> CohereReranker());
>>>>
>>>> TopDocs topDocs = TopDocs.merge(new RRFRanker(), topDocsKeyword,
>>>> topDocsVector);
>>>>
>>>> WDYT?
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>>
>>>> Am 10.05.24 um 21:08 schrieb Michael Wechner:
>>>>
>>>> great, yes, let's get started :-)
>>>>
>>>> What about the following pseudo code, assuming that there might be
>>>> alternative ranking algorithms to RRF
>>>>
>>>> StoredFieldsKeyword storedFieldsKeyword =
>>>> indexReaderKeyword.storedFields();
>>>> StoredFieldsVector storedFieldsVector =
>>>> indexReaderKeyword.storedFields();
>>>>
>>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
>>>> TopDocs topDocsVector = vectorSearcher.search(vectorQuery, 50);
>>>>
>>>> Ranker ranker = new RRFRanker();
>>>> TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword, topDocsVector);
>>>>
>>>> for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
>>>>     Document docK = storedFieldsKeyword.document(scoreDoc.doc);
>>>>     Document docV = storedFieldsVector.document(scoreDoc.doc);
>>>>     ....
>>>> }
>>>>
>>>> whereas also see
>>>>
>>>>
>>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html
>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html
>>>>
>>>> WDYT?
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>>
>>>> Am 10.05.24 um 20:01 schrieb Chang Hank:
>>>>
>>>> Hi Michael,
>>>>
>>>> Sounds good to me.
>>>> Let’s do it!!
>>>>
>>>> Cheers,
>>>> Hank
>>>>
>>>> On May 10, 2024, at 10:50 AM, Michael Wechner
>>>> <michael.wech...@wyona.com> <michael.wech...@wyona.com> wrote:
>>>>
>>>> Hi Hank
>>>>
>>>> Very cool!
>>>>
>>>> Adrien Grand suggested to implement it as a utility method on the
>>>> TopDocs class, and since Adrien worked for a decade on Lucene
>>>> https://www.elastic.co/de/blog/author/adrien-grand I guess it makes
>>>> sense to follow his advice :-) We could create a PR and work together on
>>>> it, WDYT? All the best Michael
>>>> Am 10.05.24 um 18:51 schrieb Chang Hank:
>>>>
>>>> Hi Michael,
>>>>
>>>> Thank you for the reply.
>>>> This is really a cool issue to work on,  I’m happy to work on this
>>>> with you. I’ll try to do research on RRF first.
>>>> Also, are we going to implement this on the TopDocs class?
>>>>
>>>> Best,
>>>> Hank
>>>>
>>>>
>>>> On May 9, 2024, at 11:08 PM, Michael Wechner
>>>> <michael.wech...@wyona.com> <michael.wech...@wyona.com> wrote:
>>>>
>>>> Hi Hank
>>>>
>>>> Thanks for offering your help!
>>>>
>>>> I recently suggested to implement RRF (Reciprocal Rank Fusion)
>>>>
>>>> https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz
>>>>
>>>> but still have not found the time to really work on this.
>>>>
>>>> Maybe you would be interested to do this or that we work on it together
>>>> somehow?
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> Am 10.05.24 um 07:27 schrieb Chang Hank:
>>>>
>>>> Hi everyone,
>>>>
>>>> I’m Hank Chang, currently studying Information Retrieval topics. I’m
>>>> really interested in contributing to Apache Lucene and enhance my
>>>> understanding to the field.
>>>> I’ve reviewed several issues posted on the Github repository but
>>>> haven’t found a straightforward starting point. Could someone please
>>>> recommend suitable issues for a newcomer like me or suggest areas I could
>>>> assist with?
>>>>
>>>> Thank you for your time and guidance.
>>>>
>>>> Best regards,
>>>> Hank Chang
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>> --
>> Adrien
>>
>
>
>
>
>
>

Re: Any recommended issues to work on for a newcomer?

Reply via email to