Chuck Williams schrieb:
Christoph, thanks for reading through my long postings and sharing your
thoughts. I had one comment in the first proposal email stating a
conclusion to move away from cosine normalization, but I didn't share
the reasons for this conclusion. Please let me know if you agree with
the following analysis.
I believe the central issue is the term sum(t) weight(t,d)^2, as Doug
pointed out. There seem to be two possible definitions for this term:
a) The sum extends over all the terms in the document
b) The sum just extends over the terms in the query
Damned, I was a little bit sloppy. Cosine-normalisation would of course
require the sum over all terms of a document (a) and Doug is right this
probably cannot be computed efficiently.
So, cosine normalization looks like a loser to me. I'm not an expert in
this and may have the wrong analysis here. Do you see flaws in the
above?
No
I continue to believe this is an important problem and am very
appreciative that some others are digging into the issue. My specific
proposal has the benefit of not changing the score relationships
relative to Lucene today and so is good from a backward-compatibility
standpoint. It is clearly better than the current normalization in
Hits. I think that setting the top score to its (net boost) / (total
boost) is not too bad, although as indicated in the proposal this could
be further refined in an attempt to also use other factors (tf, idf
and/or length norm) in the setting of the top score. I'm not sure
whether nor not using these additional factors in the normalization
would be a good thing and would appreciate other thoughts. (Remember
that all factors will be used in the scoring -- the only question is
which are important in setting the normalized top score.)
I don't see any way to address this issue through subclassing -- fixing
it seems to require modifying Lucene source. I'd rather not diverge
from Lucene source, especially in so many fundamental classes, and so
would like to see the changes incorporated back into Lucene. Is that
likely if I make the changes?
As far as the current normalization is concerned, I think you can "switch
it off" by using your own similarity implementation: E.g. make queryNorm and
coord return 1.0. I hope it's that simple :-)
So you should be able to implement your new normalization just by changing
the scorers and IndexSearcher.
I don't think that the changes on the scorers are so big. You just add a
new method for computing your netCoord, as far as I understand. So even if
your new scoring/normalization does not find it's way into Lucene, maybe
the changes on the scorers could.
Christoph
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]