Similarity Usage: tf(int) vs tf(float)

Chris Hostetter Thu, 16 Feb 2006 15:59:31 -0800

I've been working on my own custom similarity lately, to take advantage of
some content domain knowledge.


One of the things that never really made sense to me before about the
Similarity class was the existence of the two tf methods...

   public abstract float tf(float freq);
   public float tf(int freq) { return tf((float)freq); }

...but today it finally hit me, someone pelase correct me If I'm wrong...

* tf(int) is what Scorers should use when looking at the frequencies of
  "stuff" in whole numebrs -- ie: in TermQuery where the question is "how
  many times does this term appear in the field?"

* tf(float) is what Scorers should use when looking at the frequencies
  of of "stuff" which can exist in partial states, and thus be represented
  as a fraction -- ie: in a sloppy PhraseQuery where the question is "how
  often does something aproximating this phrase appear?"

While Scorers of "sloppy" queries should use Similarty.slopyFreq(int) to
determine the (float)freq value to be used for each instance of a sloppy
match (based on the edit distance of that match), those Scorers should
pass the *sum* of the sloppyFreq for each match to tf(float).

Which means when writting your own Similarity you have to be careful to
consider what tf(float) returns not just on input between 0.0 and 1.0, but
also above 1.0 (because multiple exact or partial matches could reqult in
a sum(phraseFreq) > 1.0

But both tf(int) and tf(float) should return 0.0 when their input is 0,
otherwise non-matching results will get a positive score.


        Have I (over|under|mis)stated anything?


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Similarity Usage: tf(int) vs tf(float)

Reply via email to