I am implementing a language modelling (type) similarity function, and am using the LMDirichletSimilarity class (and its helper classes) as a template. However, it seems the LMDirichletSimilarity.class implementation is not the same as that presented in "A Study of Smoothing Methods for Language Models Applied to Information Retrieval" by Zhai and Lafferty.
The score method in LMDirichletSimilarity.class for matching terms is implemented as follows: score = (float) (Math.log(1 + freq / (mu * ((LMStats) stats).getCollectionProbability())) + Math.log(mu / (docLen + mu))) In particular, the score method in that class only provides the normalisation factor (i.e. the Math.log(mu / (docLen + mu)) bit ) for matching terms. It should actually do this normalisation for all terms in the query (regardless of whether they occur in the document). The Math.log(mu / (docLen + mu)) should really be removed and the following document-specific score should be added to the document score after the term-scoring part (unless I am missing some background scoring that is going on in Lucene): + queryLen * Math.log(mu / (docLen + mu)) Therefore, my question is as follows: Where in lucene can I add a document-specific factor just prior to sorting the final document scores? I want this to be calculated and tuneable at query-time (not index time). The boosting features of lucene seem to be inflexible (as they assume that you wish to multiply the boosting factor). I could run the initial query and then re-score the documents in the TopDocs by adding the factor, but it seems like there has to be a more efficient way to do this. As this is one of the main formulas in information retrieval, it would be nice if it was implemented correctly. Any help appreciated...