Hi, My comments in-line.
Chris Hostetter wrote:
: I would like to override the Similarity class lengthNorm(String : fieldName, int numTerms) so that it behaves similar to queryNorm(float : sumOfSquaredWeights). So the method signature becomes lengthNorm(String : fieldName, float sumOfSquaredWeights) where sumOfSquaredWeights = sum of : the squares of doc term weights. : : Looking at the way sumOfSquaredWeights was used in : org.apache.lucene.search.Query weight method, I would like to have a : weight method in org.apache.lucene.document.Field (or may be in : org.apache.lucene.document.Document) which returns the weight based on : the terms in the Field. Can anyone tell me how to start? can you explain more what you mean by "doc term weights" ?
It seems like what you are interested in doing is changing the way norm value of a doc/field is determined so that it's determined not just by the number of terms in the field, but also by the "weight" or some terms -- i'm not sure if you mean the terms being queried on, or the terms stored in the field for the document
Yes, you got the idea, i mean the terms in the field. I think term weights of the query are already factored in in queryNorm. I want to normalize based on the field's terms' weights too.
Two concepts that already exist (and may be useful to you) are: 1) the "boosts" associated with Fields and Documents at indexing time, which are combined with the lengthNorm at index time to determine a single "norm" value for the doc/field pair.
I don;t think this is what I want because the lengthNorm is still using the # of terms.
2) the idf of the terms being queried on, which is multiplied by the field norm as part of the query time scoring (you can see it in the fieldWeight in a score Explanation)
Yes, I noticed this, but this is not what I want because its using "idf of the terms being queried". What I want fieldWeight to be is to use the 1/ sqrt(sumOfSquaredWeights), where sumOfSquaredWeights = tf^2 over all terms in the field.
3) I got another issue with the explanation, which seems to be a bug. Below, I;ve given a printout of the explanation. There's something strange when I use my own Similarity it prints out all query terms despite some them not appearing in the doc (See for "formulation" the docFreq = 0 but it appears in the explanation).
Also the scores don;t tally. I printed out the raw score for doc 21 using the HitCollector and it returns 1.4241531. I printout explanation the score is 2.731636. Shouldn't this be the same since both aren't normalized scores?
------ Explanation -------- doc id:21 score = 1.4241531 Explanation = 2.731636 = sum of: ...... 0.30496213 = weight(Contents:formulation in 21), product of: 0.40874794 = queryWeight(Contents:formulation), product of: 5.9687076 = idf(docFreq=0) 0.06848182 = queryNorm 0.74608845 = fieldWeight(Contents:formulation in 21), product of: 1.0 = tf(termFreq(Contents:formulation)=0) 5.9687076 = idf(docFreq=0) 0.125 = fieldNorm(field=Contents, doc=21) ...... ------ End of Explanation -------- Thanks. -- Eugene --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]