Re: Fuzzy Query Similarity

Mike Drob Tue, 12 Jul 2022 13:01:37 -0700

On Mon, Jul 11, 2022 at 3:36 PM Mike Drob <[email protected]> wrote:

> Hi Uwe, thanks for all the pointers!
>
> I tried using BooleanSimilarity and the resulting scores were even more
> divergent! 1.0 for the exact match vs 1.55 (= 0.8 + 0.75) for the multiple
> terms that were close. Which makes sense with ignoring TF but still doesn't
> help me down-boost the other terms.
>
> On 2022/07/09 16:23:37 Uwe Schindler wrote:
> > Hi
> > > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> > > matches, or even to incorporate the edit distance more generally into
> > > the per-term score, although it does seem like that would be something
> > > people would generally expect.
> >
> > Actually it does this:
> >
> >   * By default FuzzyQuery uses a rewrite method that expands all terms
> >     as should clauses into a boolean query:
> >     MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions)
> >   * TopTermsReqrite basically keeps track of a "boost" factor for each
> >     term and sorts the "best" terms in a PQ:
> >
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160
> >   * For each collected term the term enumeration sets a boost (1.0 for
> >     exact match):
> >
> https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256
> >
>
> Thanks for the link to this calculation. I spent a long time trying to
> find it but kept missing.
>
> There's some interesting things happening here by making longer terms more
> similar. Starting from "spark" we say that "spar" is 75% similar because
> it's a 4 character term that needs a single edit (1/4) and "spare" is 80%
> similar because it's a 5 character term with a single edit (1/5). I don't
> have enough information yet to say if this is expected in the application
> or not, but it explains how we get the scores so there's something
> satisfying about at least that bit.
>
> As a hacky idea, I tried changing the boost in FuzzyTermsEnum from that
> computed similarity to it squared, which worked for this exact case but
> didn't keep up with adding a third fuzzy term to that competing document.
>
> After thinking about this more, I suspect that what I really want is for
> FuzzyQuery to score as the max of any of the matching terms, rather than
> the sum? This would be a big change though. I don't know that it's fair for
> multiple approximate matches to outweigh a single exact match here. We get
> so close to what I need with
> TestFuzzyQuery.testSingleQueryExactMatchScoresHighest but it doesn't quite
> make it all the way.
>
> It looks like if I remove the hard coded use of Boolean RewriteMethod and
let it fall back to the default Disjunction Max I get the behaviour that I
want.
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MultiTermQuery.java#L184


What are the use cases where we need a summation of the scores instead of
taking the max?


> What do you think?
>
> > So in short the exact term gets a boost factor of 1 in the resulting
> > term query, all other terms a lower one.
> >
> > Uwe
> >
> > --
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail:[email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Fuzzy Query Similarity

Reply via email to