Re: Fuzzy Query Similarity

Mike Drob Mon, 11 Jul 2022 13:36:36 -0700

Hi Uwe, thanks for all the pointers!

I tried using BooleanSimilarity and the resulting scores were even more 
divergent! 1.0 for the exact match vs 1.55 (= 0.8 + 0.75) for the multiple 
terms that were close. Which makes sense with ignoring TF but still doesn't 
help me down-boost the other terms.

On 2022/07/09 16:23:37 Uwe Schindler wrote:
> Hi
> > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> > matches, or even to incorporate the edit distance more generally into
> > the per-term score, although it does seem like that would be something
> > people would generally expect.
> 
> Actually it does this:
> 
>   * By default FuzzyQuery uses a rewrite method that expands all terms
>     as should clauses into a boolean query:
>     MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions)
>   * TopTermsReqrite basically keeps track of a "boost" factor for each
>     term and sorts the "best" terms in a PQ:
>     
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160
>   * For each collected term the term enumeration sets a boost (1.0 for
>     exact match):
>     
> https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256
> 

Thanks for the link to this calculation. I spent a long time trying to find it 
but kept missing.

There's some interesting things happening here by making longer terms more 
similar. Starting from "spark" we say that "spar" is 75% similar because it's a 
4 character term that needs a single edit (1/4) and "spare" is 80% similar 
because it's a 5 character term with a single edit (1/5). I don't have enough 
information yet to say if this is expected in the application or not, but it 
explains how we get the scores so there's something satisfying about at least 
that bit.

As a hacky idea, I tried changing the boost in FuzzyTermsEnum from that 
computed similarity to it squared, which worked for this exact case but didn't 
keep up with adding a third fuzzy term to that competing document.

After thinking about this more, I suspect that what I really want is for 
FuzzyQuery to score as the max of any of the matching terms, rather than the 
sum? This would be a big change though. I don't know that it's fair for 
multiple approximate matches to outweigh a single exact match here. We get so 
close to what I need with TestFuzzyQuery.testSingleQueryExactMatchScoresHighest 
but it doesn't quite make it all the way.

What do you think?

> So in short the exact term gets a boost factor of 1 in the resulting 
> term query, all other terms a lower one.
> 
> Uwe
> 
> -- 
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail:[email protected]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Fuzzy Query Similarity

Reply via email to