On Mon, Jul 11, 2022 at 3:36 PM Mike Drob <md...@apache.org> wrote: > Hi Uwe, thanks for all the pointers! > > I tried using BooleanSimilarity and the resulting scores were even more > divergent! 1.0 for the exact match vs 1.55 (= 0.8 + 0.75) for the multiple > terms that were close. Which makes sense with ignoring TF but still doesn't > help me down-boost the other terms. > > On 2022/07/09 16:23:37 Uwe Schindler wrote: > > Hi > > > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact > > > matches, or even to incorporate the edit distance more generally into > > > the per-term score, although it does seem like that would be something > > > people would generally expect. > > > > Actually it does this: > > > > * By default FuzzyQuery uses a rewrite method that expands all terms > > as should clauses into a boolean query: > > MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions) > > * TopTermsReqrite basically keeps track of a "boost" factor for each > > term and sorts the "best" terms in a PQ: > > > https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160 > > * For each collected term the term enumeration sets a boost (1.0 for > > exact match): > > > https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256 > > > > Thanks for the link to this calculation. I spent a long time trying to > find it but kept missing. > > There's some interesting things happening here by making longer terms more > similar. Starting from "spark" we say that "spar" is 75% similar because > it's a 4 character term that needs a single edit (1/4) and "spare" is 80% > similar because it's a 5 character term with a single edit (1/5). I don't > have enough information yet to say if this is expected in the application > or not, but it explains how we get the scores so there's something > satisfying about at least that bit. > > As a hacky idea, I tried changing the boost in FuzzyTermsEnum from that > computed similarity to it squared, which worked for this exact case but > didn't keep up with adding a third fuzzy term to that competing document. > > After thinking about this more, I suspect that what I really want is for > FuzzyQuery to score as the max of any of the matching terms, rather than > the sum? This would be a big change though. I don't know that it's fair for > multiple approximate matches to outweigh a single exact match here. We get > so close to what I need with > TestFuzzyQuery.testSingleQueryExactMatchScoresHighest but it doesn't quite > make it all the way. > > It looks like if I remove the hard coded use of Boolean RewriteMethod and let it fall back to the default Disjunction Max I get the behaviour that I want. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MultiTermQuery.java#L184
What are the use cases where we need a summation of the scores instead of taking the max? > What do you think? > > > So in short the exact term gets a boost factor of 1 in the resulting > > term query, all other terms a lower one. > > > > Uwe > > > > -- > > Uwe Schindler > > Achterdiek 19, D-28357 Bremen > > https://www.thetaphi.de > > eMail:u...@thetaphi.de > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >