[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

Robert Muir (JIRA) Wed, 30 Mar 2011 07:54:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012996#comment-13012996
 ]


Robert Muir commented on LUCENE-2959:
-------------------------------------

Hi David, to try to help get things moving, I created a branch: 
https://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring

This is the in-progress work from LUCENE-2392, which separates the scoring 
calculates from the postings-list matching. In short, Similarity becomes very 
low-level, but the idea is that you extend it to present a higher-level API 
(for example TFIDFSimilarity: 
http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java)
 that is user-friendly and allows users to adjust parameters in a way that 
makes sense to that scoring system.

As a start I implemented some very very rough/basic models in src/test:
BM25: 
http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockBM25Similarity.java

Dirichlet LM: 
http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockLMSimilarity.java

But these are in no way correct or extensible or nice. For example, the BM25 
similarity is slow, because as implemented its "average document length" is 
"live" (e.g. if you add more segments its immediately adjusted for each 
query)... there is no caching at all. 

For example in this case, to speed up BM25, it could be nice for the Similarity 
to pull this up-front, and create cached calculations. If a user wants to 
refresh their bm25 stats then they could do something in SimilarityProvider to 
call it to recalculate the caches.

However, for a user that wants "super-realtime" view, it might be better for it 
to stay the way it is now, or alternatively for the Sim to up-front do the 256 
calculations per query (ideally in weight, not per-segment in docscorer) to 
tableize the length normalizations.

So these are the API challenges we need to consider if we want to provide 
actual implementations of these scoring systems: how to make them perform close 
to or as fast as lucene's current scoring model.

Separately on the issue, I want to make Weight completely opaque to the sim, 
really its just a way for a Similarity to compute things up front (such as IDF, 
but maybe things like these bm25 length norm caches too). Currently it can only 
have a single float value (see my un-sqrt'ing and other hacks in the Mock 
sims), so this should be fixed.

Additionally another big TODO: just as Scorer was split (maybe we should rename 
it to Matcher now that sim does the calcs?), the process of Explanations need 
to be split too, where a Sim is completely responsible for explaining itself.

Another TODO i have is to write the norm summation into the norms file as a 
single vlong, rather than computing it across all byte[] in segmentreader like 
I do now... I just implemented it this way so that we could play with scoring 
algorithms easily.

So, the good news would be that scoring is a lot more flexible, but the "bad" 
news is that in order to support lucene's features, implementing a new ranking 
system on top of Similarity is really *serious* work, as you need to:
# implement the lower-level API efficiently, yet expose a nice high-level API 
such as TFIDFSimilarity's tf() and idf() hooks for users.
# implement explanations so that users can debug relevance issues.
# think about allowing users to balance the various performance tradeoffs, such 
as balancing the performance gained by caching things versus using realtime 
statistics (some of this could be in my head, maybe computing 256 norm decoder 
caches up-front is really cheap and a non-issue).
# consider how to integrate lucene's features into the ranking system, for 
example how to estimate a reasonable "phrase IDF" for phrase/multiphrase/span 
queries, how to integrate index-time boosts (in my example BM25 etc, I just 
made the documents appear shorter to accomplish this), depending upon how the 
length normalization is being stored in the index, how to pick the best 
quantization (might not be SmallFloat352), etc etc. 
# do all the relevance testing to ensure that things are correct (i found lots 
of bugs doing rough testing on my Mock ones, there are probably more!, but on 
the couple test collections i tried they seemed reasonable)
# adding good quality documentation such as what we have today in 
TFIDFSimilarity that explains how the ranking system works and how you can tune 
it.


> [GSoC] Implementing State of the Art Ranking for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-2959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2959
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Examples, Javadocs, Query/Scoring
>            Reporter: David Mark Nemeskey
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: implementation_plan.pdf, proposal.pdf
>
>
> Lucene employs the Vector Space Model (VSM) to rank documents, which compares
> unfavorably to state of the art algorithms, such as BM25. Moreover, the 
> architecture is
> tailored specically to VSM, which makes the addition of new ranking functions 
> a non-
> trivial task.
> This project aims to bring state of the art ranking methods to Lucene and to 
> implement a
> query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

Reply via email to