[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012996#comment-13012996 ]
Robert Muir commented on LUCENE-2959: ------------------------------------- Hi David, to try to help get things moving, I created a branch: https://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring This is the in-progress work from LUCENE-2392, which separates the scoring calculates from the postings-list matching. In short, Similarity becomes very low-level, but the idea is that you extend it to present a higher-level API (for example TFIDFSimilarity: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java) that is user-friendly and allows users to adjust parameters in a way that makes sense to that scoring system. As a start I implemented some very very rough/basic models in src/test: BM25: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockBM25Similarity.java Dirichlet LM: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockLMSimilarity.java But these are in no way correct or extensible or nice. For example, the BM25 similarity is slow, because as implemented its "average document length" is "live" (e.g. if you add more segments its immediately adjusted for each query)... there is no caching at all. For example in this case, to speed up BM25, it could be nice for the Similarity to pull this up-front, and create cached calculations. If a user wants to refresh their bm25 stats then they could do something in SimilarityProvider to call it to recalculate the caches. However, for a user that wants "super-realtime" view, it might be better for it to stay the way it is now, or alternatively for the Sim to up-front do the 256 calculations per query (ideally in weight, not per-segment in docscorer) to tableize the length normalizations. So these are the API challenges we need to consider if we want to provide actual implementations of these scoring systems: how to make them perform close to or as fast as lucene's current scoring model. Separately on the issue, I want to make Weight completely opaque to the sim, really its just a way for a Similarity to compute things up front (such as IDF, but maybe things like these bm25 length norm caches too). Currently it can only have a single float value (see my un-sqrt'ing and other hacks in the Mock sims), so this should be fixed. Additionally another big TODO: just as Scorer was split (maybe we should rename it to Matcher now that sim does the calcs?), the process of Explanations need to be split too, where a Sim is completely responsible for explaining itself. Another TODO i have is to write the norm summation into the norms file as a single vlong, rather than computing it across all byte[] in segmentreader like I do now... I just implemented it this way so that we could play with scoring algorithms easily. So, the good news would be that scoring is a lot more flexible, but the "bad" news is that in order to support lucene's features, implementing a new ranking system on top of Similarity is really *serious* work, as you need to: # implement the lower-level API efficiently, yet expose a nice high-level API such as TFIDFSimilarity's tf() and idf() hooks for users. # implement explanations so that users can debug relevance issues. # think about allowing users to balance the various performance tradeoffs, such as balancing the performance gained by caching things versus using realtime statistics (some of this could be in my head, maybe computing 256 norm decoder caches up-front is really cheap and a non-issue). # consider how to integrate lucene's features into the ranking system, for example how to estimate a reasonable "phrase IDF" for phrase/multiphrase/span queries, how to integrate index-time boosts (in my example BM25 etc, I just made the documents appear shorter to accomplish this), depending upon how the length normalization is being stored in the index, how to pick the best quantization (might not be SmallFloat352), etc etc. # do all the relevance testing to ensure that things are correct (i found lots of bugs doing rough testing on my Mock ones, there are probably more!, but on the couple test collections i tried they seemed reasonable) # adding good quality documentation such as what we have today in TFIDFSimilarity that explains how the ranking system works and how you can tune it. > [GSoC] Implementing State of the Art Ranking for Lucene > ------------------------------------------------------- > > Key: LUCENE-2959 > URL: https://issues.apache.org/jira/browse/LUCENE-2959 > Project: Lucene - Java > Issue Type: New Feature > Components: Examples, Javadocs, Query/Scoring > Reporter: David Mark Nemeskey > Labels: gsoc2011, lucene-gsoc-11, mentor > Attachments: implementation_plan.pdf, proposal.pdf > > > Lucene employs the Vector Space Model (VSM) to rank documents, which compares > unfavorably to state of the art algorithms, such as BM25. Moreover, the > architecture is > tailored specically to VSM, which makes the addition of new ranking functions > a non- > trivial task. > This project aims to bring state of the art ranking methods to Lucene and to > implement a > query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org