[ 
https://issues.apache.org/jira/browse/LUCENE-7439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7439:
---------------------------------------
    Attachment: LUCENE-7439.patch

Here's a patch fixing {{FuzzyQuery}} to also accept small terms.

With the simplified {{FuzzyTermsEnum}} it was quite simple to fix it (remove 
the {{while}} loop), and to fix the test case to verify any term within the 
specified edit distance does match.

The one wrinkle is that such matches get a boost of 0.0, because the formula we 
use to compute the boost for a matched term ({{1.0 - editDistance / 
minTermLength}}) can be <= 0.  I think this is fair: such matches are poor 
quality compared to longer term matches.

> Should FuzzyQuery match short terms too?
> ----------------------------------------
>
>                 Key: LUCENE-7439
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7439
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0), 6.3
>
>         Attachments: LUCENE-7439.patch, LUCENE-7439.patch, LUCENE-7439.patch
>
>
> Today, if you ask {{FuzzyQuery}} to match {{abcd}} with edit distance 2, it 
> will fail to match the term {{ab}} even though it's 2 edits away.
> Its javadocs explain this:
> {noformat}
>  * <p>NOTE: terms of length 1 or 2 will sometimes not match because of how 
> the scaled
>  * distance between two terms is computed.  For a term to match, the edit 
> distance between
>  * the terms must be less than the minimum length term (either the input 
> term, or
>  * the candidate term).  For example, FuzzyQuery on term "abcd" with 
> maxEdits=2 will
>  * not match an indexed term "ab", and FuzzyQuery on term "a" with maxEdits=2 
> will not
>  * match an indexed term "abc".
> {noformat}
> On the one hand, I can see that this behavior is sort of justified in that 
> 50% of the characters are different and so this is a very "weak" match, but 
> on the other hand, it's quite unexpected since edit distance is such an exact 
> measure so the terms should have matched.
> It seems like the behavior is caused by internal implementation details about 
> how the relative (floating point) score is computed.  I think we should fix 
> it, so that edit distance 2 does in fact match all terms with edit distance 
> <= 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to