Remember: we're not really doing cosine at all here.  The factor of IDF^2 on

the top, with the factor of 1/sqrt(numTermsInDocument) on the bottom couples

together to end up with the following effect:

 q1 = "TERM1"
 q2 = "TERM2"

doc1 = "TERM1"
doc2 = "TERM2"

score(q1, doc1) = idf(TERM1)
score(q2, doc2) = idf(TERM2)

Both are perfect matches, but one scores higher (possibly much higher) than
the other.

Boosts work just fine with cosine (it's just a way of putting "tf" into the
query side
as well as in the document side), but normalizing documents without taking
the
idf of terms in the document into consideration blows away the ability to
compare scores in default Lucene scoring, even *with* the queryNorm()
factored
in.

I know you probably know this Mark, but it's important to make sure we're
stating
that in Lucene as is currently structured, scores can be *wildly* different
between
queries, even with queryNorm() factored in, for the sake of people reading
this
who haven't worked through the scoring in detail.

  -jake


On Fri, Nov 20, 2009 at 2:24 PM, Mark Miller <markrmil...@gmail.com> wrote:

> Grant Ingersoll wrote:
> >
> >  What I would like to get at is why anyone thinks scores are
> > comparable across queries to begin with.
> >
> They are somewhat comparable because we are using the approximate cosine
> between the document/query vectors for the score - plus boosts n stuff.
> How close the vectors are to each other. If q1 has a smaller angle diff
> with d1 than q2 does with d2, then you can do a comparison. Its just
> vector similarities. Its approximate because we fudge the normalization.
> Why do you think the scores within a query search are comparable? Whats
> the difference when you try another query? The query is the difference,
> and the query norm is what makes it more comparable. Its just a
> different query vector with another query. Its still going to just be a
> given "angle" from the doc vectors. Closer is considered a better match.
> We don't do it to improve anything, or because someone discovered
> something - its just part of the formula for calculating the cosine. Its
> the dot product formula. You can lose it and keep the same relative
> rankings, but then you are further from the cosine for the score - you
> start scaling by the magnitude of the query vector. When you do that
> they are not so comparable.
>
> If you take out the queryNorm, its much less comparable. You are
> effectively multiplying the cosine by the magnitude of the query vector
> - so different queries will scale the score differently - and not in a
> helpful way - a term vector and query vector can have very different
> magnitudes, but very similar term distributions. Thats why we are using
> the cosine rather than euclidean distance in the first place. Pretty
> sure its more linear algebra than IR - or the vector stuff from calc 3
> (or wherever else different schools put it).
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Reply via email to