[
https://issues.apache.org/jira/browse/LUCENE-5847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079387#comment-14079387
]
Ryan Ernst commented on LUCENE-5847:
------------------------------------
Why can't the background score be implemented in the specific scorers for
Dirichlet or JM? I don't think the Scorer interface should be cluttered with
something specific to one implementation.
> Improved implementation of language models in lucene
> -----------------------------------------------------
>
> Key: LUCENE-5847
> URL: https://issues.apache.org/jira/browse/LUCENE-5847
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Reporter: Hadas Raviv
> Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-2507.patch
>
>
> The current implementation of language models in lucene is based on the paper
> "A Study of Smoothing Methods for Language Models Applied to Ad Hoc
> Information Retrieval" by Zhai and Lafferty ('01). Specifically,
> LMDiricheltSimilarity and LMJelinikMercerSimilarity use a normalized smoothed
> score for a matching term in a document, as suggested in the above mentioned
> paper.
> However, lucene doesn't assign a score to query terms that do not appear in a
> matched document. According to the "pure" LM approach, these terms should be
> assigned with a collection probability "background score". If one uses the
> Jelinik Mercer smoothing method, the final result list produced by lucene is
> rank equivalent to the one that would have been created by a full LM
> implementation. However, this is not the case for Dirichlet smoothing method,
> because the background score is document dependent. Documents in which not
> all query terms appear, are missing the document-dependant background score
> for the missing terms. This component affects the final ranking of documents
> in the list.
> Since LM is a baseline method in many works in the IR research field, I
> attach a patch that implements a full LM in lucene. The basic issue that
> should be addressed here is assigning a document with a score that depends on
> *all* the query terms, collection statistics and the document length. The
> general idea of what I did is adding a new getBackGroundScore(int docID)
> method to similarity, scorer and bulkScorer. Than, when a collector assigns a
> score to a document (score = scorer.score()) I added the backgound score
> (score=scorer.score()+scorer.background(doc)) that is assigned by the
> similarity class used for ranking.
> The patch also includes a correction of the document length such that it will
> be the real document length and not the encoded one. It is required for the
> full LM implementation.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]