[jira] [Commented] (LUCENE-5847) Improved implementation of language models in lucene

Ryan Ernst (JIRA) Wed, 30 Jul 2014 08:37:57 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079387#comment-14079387
 ]


Ryan Ernst commented on LUCENE-5847:
------------------------------------

Why can't the background score be implemented in the specific scorers for 
Dirichlet or JM? I don't think the Scorer interface should be cluttered with 
something specific to one implementation.

> Improved implementation of language models in lucene 
> -----------------------------------------------------
>
>                 Key: LUCENE-5847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5847
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Hadas Raviv
>            Priority: Minor
>             Fix For: 5.0
>
>         Attachments: LUCENE-2507.patch
>
>
> The current implementation of language models in lucene is based on the paper 
> "A Study of Smoothing Methods for Language Models Applied to Ad Hoc 
> Information Retrieval" by Zhai and Lafferty ('01). Specifically, 
> LMDiricheltSimilarity and LMJelinikMercerSimilarity use a normalized smoothed 
> score for a matching term in a document, as suggested in the above mentioned 
> paper.
> However, lucene doesn't assign a score to query terms that do not appear in a 
> matched document. According to the "pure" LM approach, these terms should be 
> assigned with a collection probability "background score". If one uses the 
> Jelinik Mercer smoothing method, the final result list produced by lucene is 
> rank equivalent to the one that would have been created by a full LM 
> implementation. However, this is not the case for Dirichlet smoothing method, 
> because the background score is document dependent. Documents in which not 
> all query terms appear, are missing the document-dependant background score 
> for the missing terms. This component affects the final ranking of documents 
> in the list.
> Since LM is a baseline method in many works in the IR research field, I 
> attach a patch that implements a full LM in lucene. The basic issue that 
> should be addressed here is assigning a document with a score that depends on 
> *all* the query terms, collection statistics and the document length. The 
> general idea of what I did is adding a new getBackGroundScore(int docID) 
> method to similarity, scorer and bulkScorer. Than, when a collector assigns a 
> score to a document (score = scorer.score()) I added the backgound score 
> (score=scorer.score()+scorer.background(doc)) that is assigned by the 
> similarity class used for ranking. 
> The patch also includes a correction of the document length such that it will 
> be the real document length and not the encoded one. It is required for the 
> full LM implementation.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5847) Improved implementation of language models in lucene

Reply via email to