I am curious to know whether the Lucene's scoring algorithm was updated in the latest 1.3 version.
I find the following scoring algorithm in the Similarity class of JAVA API documents. This method is different from the one shown in official FAQ. Could you tell me which one is being used in 1.3? If the algorithm was updated, please send me the formula. I will appreciate that. Thanks, Chong-Ki The score of query q for document d is defined in terms of these methods as follows: score(q,d) = Σ <http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi larity.html#tf(int)> tf(t in d) * <http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi larity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)> idf(t) * <http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Fi eld.html#getBoost()> getBoost(t.field in d) * <http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi larity.html#lengthNorm(java.lang.String, int)> lengthNorm(t.field in d) * <http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi larity.html#coord(int, int)> coord(q,d) * <http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi larity.html#queryNorm(float)> queryNorm(q) t in q http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simil arity.html For the official FAQ, Lucene's scoring algorithm is shown as, 31. How does Lucene assigns scores to hits ? Here is a quote from Doug himself (posted on July 2001 to the Lucene users mailing list): For the record, Lucene's scoring algorithm is, roughly: score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) where: score_d : score for document d sum_t : sum for all terms t tf_q : the square root of the frequency of t in the query tf_d : the square root of the frequency of t in d idf_t : log(numDocs/docFreq_t+1) + 1.0 numDocs : number of documents in index docFreq_t : number of documents containing t norm_q : sqrt(sum_t((tf_q*idf_t)^2)) norm_d_t : square root of number of tokens in d in the same field as t (I hope that's right!) [Doug later added...] Make that: score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d where boost_t : the user-specified boost for term t coord_q_d : number of terms in both query and document / number of terms in query The coordination factor gives an AND-like boost to documents that contain, e.g., all three terms in a three word query over those that contain just two of the words. http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.se arch <http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.s earch&toc=faq#q31> &toc=faq#q31
