Re: Lucene 1.2 - scoring forumla needed

Joaquin Delgado Sun, 10 Sep 2006 09:35:47 -0700

What do you mean by mathematically correct? Is there something incorrectin the book?

According to a message posted some time ago athttp://mail-archives.apache.org/mod_mbox/lucene-java-dev/200307.mbox/[EMAIL PROTECTED], where people first noticed a change in the scoring algorithm, theofficial FAQ (for 1.2) had posted, from Doug himself the following formula:

score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *boost_t) * coord_q_d


where

   * score (q,d) : score for document d given query q
   * sum_t : sum for all terms t in q
   * tf_q : the square root of the frequency of t in q
   * tf_d : the square root of the frequency of t in d
   * idf_t : log(numDocs/docFreq_t+1) + 1.0
   * numDocs : number of documents in index
   * docFreq_t : number of documents containing t
   * norm_q : sqrt(sum_t((tf_q*idf_t)^2))
   * norm_d_t : square root of number of tokens in d in the same field
     as t
   * boost_t : the user-specified boost for term t
   * coord_q_d : number of terms in both query and document / number of
     terms in query The coordination factor gives an AND-like boost to
     documents that contain, e.g., all three terms in a three word
     query over those that contain just two of the words.

This is diffirent that the current scoring algorithm described athttp://lucene.apache.org/java/docs/scoring.html#Scoring which includesfield boosting, document length normalization, etc.

In any case these are variations of the TF-IDF weighted vector space"cosine of the angle" between the document and the query vectors (alsoknown as cosine distance or normalized dot product - seehttp://en.wikipedia.org/wiki/Dot_product). This computation treatsdocuments and queries as vectors in an N-dimensional space (N is thenumber of unique terms excluding stopwords).

In statistics/probabilistc terms this can also be interpretated as ageometrical interpretation of correlation between samples drawn fromtwo random variables Q and D (representing a query and a document -seehttp://en.wikipedia.org/wiki/Correlation) whereas each data point(TF-IDF weight) is an estimation of how much "information" each termconveys. There are more complex probabilistc rankings algorithms whichtake advantage of previous knowledge of relevance (pre-ranked documentsfor example) in its computation primarily exploiting bayes theorem.

Both Vector Space Model and Probabilistic Model are well studied inInformation Retrieval Literature. Seehttp://www2.sims.berkeley.edu/courses/is202/f00/lectures/Lecture8_202.pptfor an overview of Ranking and Feedback.




-- Joaquin Delgado




Karl Koch wrote:

Hi,

I am looking for a mathematically correct IR scoring formula for Lucene 1.2. 
The description in the book (Lucene in Action, 2005 edition) is rather 
non-mathematical, also I am not sure if this is the one that also counts for 
Lucene 1.2 and not for later versions.

Perhaps Eric or Otis can directy comment on this? Is there any paper on the 
Lucene scoring algorithm that was published and describes the formula in depth?

Best Regards,
Karl

Re: Lucene 1.2 - scoring forumla needed

Reply via email to