Hello,
My index is, I guess, a little bit unorthodox: out of 75 000 entries, 55 000 comes from a single source and the last 20 000 from 15 other sources. When executing a query though I would like to see all the sources on the same starting line, and it doesn't seem to be the case: the documents coming from the "big" source are always last in the results lists. I then checked how the score is calculated: score(q,d) = coord(q,d) . queryNorm(q) . SUM( tf(t in d) . idf(t)^2 t.getBoost() . norm(t, d) ) where q is a query, t is a term and d a document I won't go into all the details, for that check the following link: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apac he/lucene/search/Similarity.html <http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apac he/lucene/search/Similarity.html> but using excellent Luke application, written by Andrezej Bialecki, I understood that my problem is due to the last element of the equation: norm(t, d).. Using Luke I checked the detailed explanation of a document after executing a simple query on the index, something like content:"anExpression". The score was equal to tf( ) * idf() * fieldNorm( ), and in there, fieldNorm = 0.0001, completely killing the score. this fieldNorn() value is, I guess, equal to the t.getBoost( ) . norm( ) we see in the main equation? When checking the details of the document fields for all the documents in Luke, I could read that the boost field was amazingly low at 0.00419... Why is that? What is this boost value exactly? The document boost? Note: Based on Lucene documentation, norm( ) encapsulate the document boost, the field boost and the lengthNorm( ) value. This lengthNorm( ) value is computed when the document is added to the index in accordance with the number of tokens of this field in this document. It should, in theory, be of no impact here. I went a little bit further, creating 3 indexes for 3 different sources (one index per source). I got the following metrics in Luke: Index size: 190 boost value for all documents: 0.08222 fieldNorm( ) value for all documents = 0.0024 Index size: 789 boost value for all documents: 0.067 fieldNorm( ) value for all documents = 0.0020 Index size: 6838 boost value for all documents: 0.012 fieldNorm( ) value for all documents = 0.0004 Index size: 50 000 boost value for all documents: 0.004193 fieldNorm( ) value for all documents = 0.0001 As we can see there is a clear correlation between the size of the index and the boost value associated with. Take note that when looking at a single source, the size and URL patterns of every referenced documents are alike. There is indeed something strange. If we agree that the fieldNorm( ) value is function of the boost value, then I have a problem with the boost. The boost value displayed in Luke should be the same! Except if the document boost or the field boost is, somehow, linked to the size of the index... or the size of a segment? or the URL pattern? Basically my question is: How is the boost value displayed in Luke as a field for every indexed document calculated? Thank you and good week-end, David
