[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014472#comment-13014472 ]
David Mark Nemeskey commented on LUCENE-2959: --------------------------------------------- Robert, As for the problems with BM25F {quote} * for any field, Lucene has a per-field terms dictionary that contains that term's docFreq. To compute BM25f's IDF method would be challenging, because it wants a docFreq "across all the fields". * the same issue applies to length normalization, lucene has a "field length" but really no concept of document length. {quote} One thing that is not clear for me is why these limitations would not be a problem for BM25. As I see it, the difference between the two methods is that BM25 simply computes tfs, idfs and document length from the whole document -- which, according to what you said, is not available Lucene. That's why I figured that a variant of BM25F would actually be more straightforward to implement. {quote} (its not clear to me at a glance either from the original paper, if this should be across only the fields in the query, across all the fields in the document, and if a "static" schema is implied in this scoring system (in lucene document 1 can have 3 fields and document 2 can have 40 different ones, even with different properties). {quote} Actually I am not sure there is a consensus on what BM25F actually is. :) For example, the BM25 formula can be applied to the weighted sum of field tfs, or alternatively, the per-field BM25 scores can be summarized as well after normalization. I've seen both called (maybe incorrectly) BM25F. If I understand correctly, the current scoring algorithm takes into account only the fields explicitly specified in the query. Is that right? If so, I see no reason why BM25 should behave otherwise. Which of course also means that we probably won't be able to save the summarized doc length and idf. Robert, would you be so kind to have a look at my proposal? It can be found at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1. It's basically the same as what I sent to the mailing list. I wrote that I want to implement BM25, BM25F and DFR ("the framework", I meant with one or two smoothing models), as well as to convert the original scoring to the new framework. In light of the thread here, I guess it would be better to modify these goals, perhaps by: * deleting the conversion part? * committing myself to BM25/BM25F only? * explicitly stating that I want a higher level API based on the low-level one? As for the last item, it is only if I continue / join the work in 2392. Since I guess nobody wants two ranking frameworks, of course I will, but then in this part of the proposal should I just concentrate on the higher level API? Thanks! > [GSoC] Implementing State of the Art Ranking for Lucene > ------------------------------------------------------- > > Key: LUCENE-2959 > URL: https://issues.apache.org/jira/browse/LUCENE-2959 > Project: Lucene - Java > Issue Type: New Feature > Components: Examples, Javadocs, Query/Scoring > Reporter: David Mark Nemeskey > Labels: gsoc2011, lucene-gsoc-11, mentor > Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, > proposal.pdf > > > Lucene employs the Vector Space Model (VSM) to rank documents, which compares > unfavorably to state of the art algorithms, such as BM25. Moreover, the > architecture is > tailored specically to VSM, which makes the addition of new ranking functions > a non- > trivial task. > This project aims to bring state of the art ranking methods to Lucene and to > implement a > query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org