[
https://issues.apache.org/jira/browse/LUCENE-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861960#comment-16861960
]
Mark Harwood commented on LUCENE-8840:
--------------------------------------
{quote}we shouldn't favor documents that contain multiple variations of the
same fuzzy term.
{quote}
For fuzzy I agree that rewarding more variations in a doc is probably
undesirable - a doc will normally pick one spelling for a word and use it
consistently so any variations are more likely to be false positives (your
baz/bad example). Plurals and other forms of suffix would be a notable
exception but I don't think that's too much of a problem because:
# we can assume that stemming is taking care of normalizing these tokens.
# a lot of fuzzy querying is for things like people names that aren't
expressed as plurals or with other common suffixes
I think all forms of automatic expansions (synonym, fuzzy, wildcard) need a
form of score blending for the expansions they create. Wildcards are perhaps
unlike fuzzy in that finding multiple variations in a doc _is_ desirable - we
_are_ looking for multiple forms and a document that contains many is better
than few.
> TopTermsBlendedFreqScoringRewrite should use SynonymQuery
> ---------------------------------------------------------
>
> Key: LUCENE-8840
> URL: https://issues.apache.org/jira/browse/LUCENE-8840
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Jim Ferenczi
> Priority: Major
> Attachments: LUCENE-8840.patch
>
>
> Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite
> method for Fuzzy queries, uses the BlendedTermQuery to score documents that
> match the fuzzy terms. This query blends the frequencies used for scoring
> across the terms and creates a disjunction of all the blended terms. This
> means that each fuzzy term that match in a document will add their BM25 score
> contribution. We already have a query that can blend the statistics of
> multiple terms in a single scorer that sums the doc frequencies rather than
> the entire BM25 score: the SynonymQuery. Since
> https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles
> boost between 0 and 1 so it should be easy to change the default rewrite
> method for Fuzzy queries to use it instead of the BlendedTermQuery. This
> would bound the contribution of each term to the final score which seems a
> better alternative in terms of relevancy than the current solution.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]