Hello Karl,

I’m very interested in the details of Lucene’s scoring as well.

Karl Koch wrote:
For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with
norm_q : sqrt(sum_t((tf_q*idf_t)^2))

which is also called cosine normalisation. This is a technique that is rather 
comprehensive and usually used for docuemnts only(!) in all systems I have seen 
so far.

I hope I have understood http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm and your problem correctly: "queryNorm(q) is a normalizing factor used to make scores between queries comparable."

For "normal" searches you don’t need to compare queries. You have just to compare the documents of a single query. Queries in a "normal" search have usually a different semantic, so you can’t really compare the results of different queries.

If you use Lucene for instance for classification of documents it is necessary to compare the results of different queries. You have documents to classify indexed at one site and the classes at the other side (thread "Store a document-like map" http://www.gossamer-threads.com/lists/lucene/java-user/42816). Than you can generate queries from the classes and search against the documents. The score of a matching document is the similarity of the document to the query build from the class. Now the queries have to be comparable.

You can transform a document into a query and a query into a document. That could be the reason normalizing a query like a document.

For the documents Lucene employs its norm_d_t which is explained as:

norm_d_t : square root of number of tokens in d in the same field as t

basically just the square root of the number of unique terms in the document (since I do search over all fields always). I would have expected cosine normalisation here...
The paper you provided uses document normalisation in the following way:

norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))

I am not sure how this relates to norm_d_t.

"norm(t,d)   =   doc.getBoost()  •  lengthNorm(field)  •  ∏ f.getBoost()"
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm)

That seems to be in depended of the documents length. The factor lengthNorm(field) uses the documents length or better the field length: "Computes the normalization value for a field given the total number of terms contained in a field." (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm).

"Implemented as 1/sqrt(numTerms)" (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/DefaultSimilarity.html#lengthNorm(java.lang.String,%20int))

Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to