I've been working on my own custom similarity lately, to take advantage of some content domain knowledge.
One of the things that never really made sense to me before about the Similarity class was the existence of the two tf methods... public abstract float tf(float freq); public float tf(int freq) { return tf((float)freq); } ...but today it finally hit me, someone pelase correct me If I'm wrong... * tf(int) is what Scorers should use when looking at the frequencies of "stuff" in whole numebrs -- ie: in TermQuery where the question is "how many times does this term appear in the field?" * tf(float) is what Scorers should use when looking at the frequencies of of "stuff" which can exist in partial states, and thus be represented as a fraction -- ie: in a sloppy PhraseQuery where the question is "how often does something aproximating this phrase appear?" While Scorers of "sloppy" queries should use Similarty.slopyFreq(int) to determine the (float)freq value to be used for each instance of a sloppy match (based on the edit distance of that match), those Scorers should pass the *sum* of the sloppyFreq for each match to tf(float). Which means when writting your own Similarity you have to be careful to consider what tf(float) returns not just on input between 0.0 and 1.0, but also above 1.0 (because multiple exact or partial matches could reqult in a sum(phraseFreq) > 1.0 But both tf(int) and tf(float) should return 0.0 when their input is 0, otherwise non-matching results will get a positive score. Have I (over|under|mis)stated anything? -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]