Hello. I am using Lucene to submit fuzzy queries against an index. I have noticed that relevant matches are often retreived, but the scoring is not at all what I expected.
For example, if my query is "rightches~", a reference to a text file with the single word "righteous" is returned with a score of 100 percent. However, I think the actual score should be somewhere in the neighborhood of .66, not 1. Anyone follow me? Degree of similarity is what I want in this case. But Lucene score does not take into account how well a term matches a FuzzyQuery. That just seems to be the way Lucene is built currently. The score is based on term frequency of the actual matching term. FuzzyQuery gets rewritten as a BooleanQuery with all matching terms OR'd. Degree of similarity is what I want in this case. When "rightches~" matches "rightheous", I should get a similarity score of about .66. What I want is to get at the raw difference that Lucene uses: the Levenstein distance algorithm. I think I'll need to use the code in FuzzyTermEnum.java (or .cs) as a starting point. I figure I can can probably use that code directly somehow, or at least borrow the similarity computation. Frankly, though, I'm not sure I'm treading down the right path on this. Can anyone help with specifics, past experience, or examples? Cheers, Mike