Hi,
My comments in-line.
Chris Hostetter wrote:
: I would like to override the Similarity class lengthNorm(String
: fieldName, int numTerms) so that it behaves similar to queryNorm(float
: sumOfSquaredWeights). So the method signature becomes lengthNorm(String
: fieldName, float sumOfSquaredWeights) where sumOfSquaredWeights = sum of
: the squares of doc term weights.
:
: Looking at the way sumOfSquaredWeights was used in
: org.apache.lucene.search.Query weight method, I would like to have a
: weight method in org.apache.lucene.document.Field (or may be in
: org.apache.lucene.document.Document) which returns the weight based on
: the terms in the Field. Can anyone tell me how to start?
can you explain more what you mean by "doc term weights" ?
It seems like what you are interested in doing is changing the way norm
value of a doc/field is determined so that it's determined not just by the
number of terms in the field, but also by the "weight" or some terms --
i'm not sure if you mean the terms being queried on, or the terms stored
in the field for the document
Yes, you got the idea, i mean the terms in the field. I think term
weights of the query are already factored in in queryNorm. I want to
normalize based on the field's terms' weights too.
Two concepts that already exist (and may be useful to you) are:
1) the "boosts" associated with Fields and Documents at indexing time,
which are combined with the lengthNorm at index time to determine a single
"norm" value for the doc/field pair.
I don;t think this is what I want because the lengthNorm is still using
the # of terms.
2) the idf of the terms being queried on, which is multiplied by the field
norm as part of the query time scoring (you can see it in the
fieldWeight in a score Explanation)
Yes, I noticed this, but this is not what I want because its using "idf
of the terms being queried". What I want fieldWeight to be is to use the
1/ sqrt(sumOfSquaredWeights), where sumOfSquaredWeights = tf^2 over
all terms in the field.
3) I got another issue with the explanation, which seems to be a bug.
Below, I;ve given a printout of the explanation. There's something
strange when I use my own Similarity it prints out all query terms
despite some them not appearing in the doc (See for "formulation" the
docFreq = 0 but it appears in the explanation).
Also the scores don;t tally. I printed out the raw score for doc 21
using the HitCollector and it returns 1.4241531. I printout explanation
the score is 2.731636. Shouldn't this be the same since both aren't
normalized scores?
------ Explanation --------
doc id:21 score = 1.4241531
Explanation = 2.731636 = sum of:
......
0.30496213 = weight(Contents:formulation in 21), product of:
0.40874794 = queryWeight(Contents:formulation), product of:
5.9687076 = idf(docFreq=0)
0.06848182 = queryNorm
0.74608845 = fieldWeight(Contents:formulation in 21), product of:
1.0 = tf(termFreq(Contents:formulation)=0)
5.9687076 = idf(docFreq=0)
0.125 = fieldNorm(field=Contents, doc=21)
......
------ End of Explanation --------
Thanks.
--
Eugene
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]