[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014547#comment-13014547 ]
Robert Muir commented on LUCENE-2959: ------------------------------------- {quote} One thing that is not clear for me is why these limitations would not be a problem for BM25. As I see it, the difference between the two methods is that BM25 simply computes tfs, idfs and document length from the whole document – which, according to what you said, is not available Lucene. That's why I figured that a variant of BM25F would actually be more straightforward to implement. {quote} A variant sounds really interesting? I think you know better than me here, I just looked at the original paper and thought to myself that to implement this "by the book" might not be feasible for a while. {quote} Robert, would you be so kind to have a look at my proposal? It can be found at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1. It's basically the same as what I sent to the mailing list. I wrote that I want to implement BM25, BM25F and DFR ("the framework", I meant with one or two smoothing models), as well as to convert the original scoring to the new framework. In light of the thread here, I guess it would be better to modify these goals, perhaps by: deleting the conversion part? committing myself to BM25/BM25F only? explicitly stating that I want a higher level API based on the low-level one? {quote} I think you can decide what you want to do? Obviously I would love to see all of it done :) But its your choice, I could see you going a couple different ways: * closer to your original proposal, you could still develop a flexible scoring API on top of Similarity. Hey, all I did was move stuff from Scorer to Similarity really, which does give flexibility, but its probably not what an IR researcher would want (its low-level and confusing). So you could make a "SimpleSimilarity" or "EasySimilarity" or something thats presents a much simpler API (something closer to what terrier/indri present) on top of this, for easily implementing ranking functions? I think this would be extremely valuable long-term: who cares if we have a low-level flexible scoring API that only speed demons like, but IR practitioners find confusing and hideous? Someone who is trying to experiment with an enhancement to relevance likely doesn't care if their TREC run takes 30 seconds instead of 20 seconds if the API is really easy and they aren't wasting time fighting with lucene? If you go this route, you could implement BM25, DFR, etc as you suggested as examples to how to use this API, and there would be more of a focus on API quality and simplicity instead of performance. * or alternatively, you could refine your proposal to implement a really "production strength" version of one of these scoring systems on top of the low-level API, that would ideally have competitive performance/documentation/etc with Lucene's default scoring today. If you decide to do this, then yes, I would definitely suggest picking only one, because I think its a ton of work as I listed above, and I think there would be more focus on practical things (some probably being nuances of lucene) and performance. > [GSoC] Implementing State of the Art Ranking for Lucene > ------------------------------------------------------- > > Key: LUCENE-2959 > URL: https://issues.apache.org/jira/browse/LUCENE-2959 > Project: Lucene - Java > Issue Type: New Feature > Components: Examples, Javadocs, Query/Scoring > Reporter: David Mark Nemeskey > Labels: gsoc2011, lucene-gsoc-11, mentor > Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, > proposal.pdf > > > Lucene employs the Vector Space Model (VSM) to rank documents, which compares > unfavorably to state of the art algorithms, such as BM25. Moreover, the > architecture is > tailored specically to VSM, which makes the addition of new ranking functions > a non- > trivial task. > This project aims to bring state of the art ranking methods to Lucene and to > implement a > query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org