[jira] [Commented] (LUCENE-8840) TopTermsBlendedFreqScoringRewrite should use SynonymQuery

Mark Harwood (JIRA) Wed, 12 Jun 2019 03:04:11 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861960#comment-16861960
 ]


Mark Harwood commented on LUCENE-8840:
--------------------------------------

{quote}we shouldn't favor documents that contain multiple variations of the 
same fuzzy term.
{quote}
 

For fuzzy I agree that rewarding more variations in a doc is probably 
undesirable - a doc will normally pick one spelling for a word and use it 
consistently so any variations are more likely to be false positives (your 
baz/bad example). Plurals and other forms of suffix would be a notable 
exception but I don't think that's too much of a problem because:
 # we can assume that stemming is taking care of normalizing these tokens.
 # a lot of fuzzy querying is for things like people names that aren't 
expressed as plurals or with other common suffixes

 

I think all forms of automatic expansions (synonym, fuzzy, wildcard) need a 
form of score blending for the expansions they create. Wildcards are perhaps 
unlike fuzzy in that finding multiple variations in a doc _is_ desirable - we 
_are_ looking for multiple forms and a document that contains many is better 
than few.

 

> TopTermsBlendedFreqScoringRewrite should use SynonymQuery
> ---------------------------------------------------------
>
>                 Key: LUCENE-8840
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8840
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Major
>         Attachments: LUCENE-8840.patch
>
>
> Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite 
> method for Fuzzy queries, uses the BlendedTermQuery to score documents that 
> match the fuzzy terms. This query blends the frequencies used for scoring 
> across the terms and creates a disjunction of all the blended terms. This 
> means that each fuzzy term that match in a document will add their BM25 score 
> contribution. We already have a query that can blend the statistics of 
> multiple terms in a single scorer that sums the doc frequencies rather than 
> the entire BM25 score: the SynonymQuery. Since 
> https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles 
> boost between 0 and 1 so it should be easy to change the default rewrite 
> method for Fuzzy queries to use it instead of the BlendedTermQuery. This 
> would bound the contribution of each term to the final score which seems a 
> better alternative in terms of relevancy than the current solution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8840) TopTermsBlendedFreqScoringRewrite should use SynonymQuery

Reply via email to