Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Soeren Pekrul Tue, 12 Dec 2006 02:53:51 -0800

Hello Karl,

I’m very interested in the details of Lucene’s scoring as well.


Karl Koch wrote:

For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with
norm_q : sqrt(sum_t((tf_q*idf_t)^2))

which is also called cosine normalisation. This is a technique that is rather 
comprehensive and usually used for docuemnts only(!) in all systems I have seen 
so far.

I hope I have understoodhttp://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNormand your problem correctly: "queryNorm(q) is a normalizing factor usedto make scores between queries comparable."

For "normal" searches you don’t need to compare queries. You have justto compare the documents of a single query. Queries in a "normal" searchhave usually a different semantic, so you can’t really compare theresults of different queries.

If you use Lucene for instance for classification of documents it isnecessary to compare the results of different queries. You havedocuments to classify indexed at one site and the classes at the otherside (thread "Store a document-like map"http://www.gossamer-threads.com/lists/lucene/java-user/42816). Than youcan generate queries from the classes and search against the documents.The score of a matching document is the similarity of the document tothe query build from the class. Now the queries have to be comparable.

You can transform a document into a query and a query into a document.That could be the reason normalizing a query like a document.

For the documents Lucene employs its norm_d_t which is explained as:

norm_d_t : square root of number of tokens in d in the same field as t
basically just the square root of the number of unique terms in the document (since I do search over all fields always). I would have expected cosine normalisation here...
The paper you provided uses document normalisation in the following way:

norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))

I am not sure how this relates to norm_d_t.


"norm(t,d)   =   doc.getBoost()  •  lengthNorm(field)  •  ∏ f.getBoost()"
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm)

That seems to be in depended of the documents length. The factorlengthNorm(field) uses the documents length or better the field length:"Computes the normalization value for a field given the total number ofterms contained in a field."(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm).

"Implemented as 1/sqrt(numTerms)"(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/DefaultSimilarity.html#lengthNorm(java.lang.String,%20int))


Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Reply via email to