Hi Uwe, thanks for all the pointers! I tried using BooleanSimilarity and the resulting scores were even more divergent! 1.0 for the exact match vs 1.55 (= 0.8 + 0.75) for the multiple terms that were close. Which makes sense with ignoring TF but still doesn't help me down-boost the other terms.
On 2022/07/09 16:23:37 Uwe Schindler wrote: > Hi > > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact > > matches, or even to incorporate the edit distance more generally into > > the per-term score, although it does seem like that would be something > > people would generally expect. > > Actually it does this: > > * By default FuzzyQuery uses a rewrite method that expands all terms > as should clauses into a boolean query: > MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions) > * TopTermsReqrite basically keeps track of a "boost" factor for each > term and sorts the "best" terms in a PQ: > > https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160 > * For each collected term the term enumeration sets a boost (1.0 for > exact match): > > https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256 > Thanks for the link to this calculation. I spent a long time trying to find it but kept missing. There's some interesting things happening here by making longer terms more similar. Starting from "spark" we say that "spar" is 75% similar because it's a 4 character term that needs a single edit (1/4) and "spare" is 80% similar because it's a 5 character term with a single edit (1/5). I don't have enough information yet to say if this is expected in the application or not, but it explains how we get the scores so there's something satisfying about at least that bit. As a hacky idea, I tried changing the boost in FuzzyTermsEnum from that computed similarity to it squared, which worked for this exact case but didn't keep up with adding a third fuzzy term to that competing document. After thinking about this more, I suspect that what I really want is for FuzzyQuery to score as the max of any of the matching terms, rather than the sum? This would be a big change though. I don't know that it's fair for multiple approximate matches to outweigh a single exact match here. We get so close to what I need with TestFuzzyQuery.testSingleQueryExactMatchScoresHighest but it doesn't quite make it all the way. What do you think? > So in short the exact term gets a boost factor of 1 in the resulting > term query, all other terms a lower one. > > Uwe > > -- > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail:u...@thetaphi.de > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org