[ https://issues.apache.org/jira/browse/LUCENE-7439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-7439: --------------------------------------- Attachment: LUCENE-7439.patch I was struggling to understand the control flow in FuzzyQuery/FuzzyTermsEnum, MultiTermQuery, TopTermsRewrite, etc., so as a first step here I cleaned up deprecated code and tried to simplify FuzzyTermsEnum somewhat. The attached patch is just this cleanup; it doesn't change the behavior on short terms. All tests pass and I confirmed performance (on Wikipedia) is unchanged. I plan to first commit this cleanup (master only, removing deprecations), and then separately tackle the short terms. > Should FuzzyQuery match short terms too? > ---------------------------------------- > > Key: LUCENE-7439 > URL: https://issues.apache.org/jira/browse/LUCENE-7439 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7439.patch > > > Today, if you ask {{FuzzyQuery}} to match {{abcd}} with edit distance 2, it > will fail to match the term {{ab}} even though it's 2 edits away. > Its javadocs explain this: > {noformat} > * <p>NOTE: terms of length 1 or 2 will sometimes not match because of how > the scaled > * distance between two terms is computed. For a term to match, the edit > distance between > * the terms must be less than the minimum length term (either the input > term, or > * the candidate term). For example, FuzzyQuery on term "abcd" with > maxEdits=2 will > * not match an indexed term "ab", and FuzzyQuery on term "a" with maxEdits=2 > will not > * match an indexed term "abc". > {noformat} > On the one hand, I can see that this behavior is sort of justified in that > 50% of the characters are different and so this is a very "weak" match, but > on the other hand, it's quite unexpected since edit distance is such an exact > measure so the terms should have matched. > It seems like the behavior is caused by internal implementation details about > how the relative (floating point) score is computed. I think we should fix > it, so that edit distance 2 does in fact match all terms with edit distance > <= 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org