On May 23, 2006, at 11:38 PM, Chris Hostetter wrote:
it has the nice property of giving small increases as the frequency
increases a small amount, then increasing faster once you reach the
point
where you think small increases are significant, and then grows slower
again once you are above the point where you think more occurances are
acctually significant.
Gotcha.
"Normalization" and "norms" are funny words to use in this context,
since you're aggressively manipulating a score multiplier rather than
normalizing in the usual sense.
: I tried to graph the hyperbolic function (tip for OS X users: check
: out Grapher.app, in Utilities). It looks like by default,
everything
: cancels out it returns a constant 2. But it's pretty
complicated, so
: maybe I missed something.
Hmm... maybe i screwed up the defaults at some point ...
Nah, I found my error -- just a typo that happened somewhere while I
was swapping in the default values. I now see something similar to
what you describe, though the plateaus above and below the transition
look completely flat.
Alas ... tf() doesn't take in a field name, to do this, you'd have to
override the Similarity each time your construct a query object,
something like this i believe...
Query q = new TermQuery(t) {
public Similarity getSimilarity(Searcher s) {
return new SimilarityDelegator
(TermQuery.this.super.getSimilarity(s)) {
public float tf(freq) {
...
}
}
}
}
}
...but good lord if that isn't a pain.
Well, let's toss aside backwards-compatibility concerns for the
purposes of discussion, and see what it would take to make tf()
change per-Field.
Adding a fieldName argument to similarity.tf(freq) would add
significant overhead, since it gets called a *lot*.
To avoid that, my first thought is that you'd need to supply a
different Similarity object for each field, by adding a fieldName
argument to searcher.getSimilarity(). I doubt this would work,
because Lucene's freq/prox files -- unlike it's norms -- are
consolidated, with terms from multiple fields in one file. It would
be hard for the Scorer to know what field it was operating on.
My gut is telling me that this is another reason to consolidate freq,
prox, and norm/boost into a single stream.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]