[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

David Mark Nemeskey (JIRA) Fri, 01 Apr 2011 01:47:51 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014472#comment-13014472
 ]


David Mark Nemeskey commented on LUCENE-2959:
---------------------------------------------

Robert,

As for the problems with BM25F

{quote}
    * for any field, Lucene has a per-field terms dictionary that contains that 
term's docFreq. To compute BM25f's IDF method would be challenging, because it 
wants a docFreq "across all the fields".
    * the same issue applies to length normalization, lucene has a "field 
length" but really no concept of document length.
{quote}

One thing that is not clear for me is why these limitations would not be a 
problem for BM25. As I see it, the difference between the two methods is that 
BM25 simply computes tfs, idfs and document length from the whole document -- 
which, according to what you said, is not available Lucene. That's why I 
figured that a variant of BM25F would actually be more straightforward to 
implement.

{quote}
(its not clear to me at a glance either from the original paper, if this should 
be across only the fields in the query, across all the fields in the document, 
and if a "static" schema is implied in this scoring system (in lucene document 
1 can have 3 fields and document 2 can have 40 different ones, even with 
different properties).
{quote}

Actually I am not sure there is a consensus on what BM25F actually is. :) For 
example, the BM25 formula can be applied to the weighted sum of field tfs, or 
alternatively, the per-field BM25 scores can be summarized as well after 
normalization. I've seen both called (maybe incorrectly) BM25F.

If I understand correctly, the current scoring algorithm takes into account 
only the fields explicitly specified in the query. Is that right? If so, I see 
no reason why BM25 should behave otherwise. Which of course also means that we 
probably won't be able to save the summarized doc length and idf.

Robert, would you be so kind to have a look at my proposal? It can be found at 
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1.
 It's basically the same as what I sent to the mailing list. I wrote that I 
want to implement BM25, BM25F and DFR ("the framework", I meant with one or two 
smoothing models), as well as to convert the original scoring to the new 
framework. In light of the thread here, I guess it would be better to modify 
these goals, perhaps by:
* deleting the conversion part?
* committing myself to BM25/BM25F only?
* explicitly stating that I want a higher level API based on the low-level one?

As for the last item, it is only if I continue / join the work in 2392. Since I 
guess nobody wants two ranking frameworks, of course I will, but then in this 
part of the proposal should I just concentrate on the higher level API?

Thanks!

> [GSoC] Implementing State of the Art Ranking for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-2959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2959
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Examples, Javadocs, Query/Scoring
>            Reporter: David Mark Nemeskey
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
> proposal.pdf
>
>
> Lucene employs the Vector Space Model (VSM) to rank documents, which compares
> unfavorably to state of the art algorithms, such as BM25. Moreover, the 
> architecture is
> tailored specically to VSM, which makes the addition of new ranking functions 
> a non-
> trivial task.
> This project aims to bring state of the art ranking methods to Lucene and to 
> implement a
> query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

Reply via email to