[ https://issues.apache.org/jira/browse/LUCENE-7439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-7439: --------------------------------------- Attachment: LUCENE-7439.patch Here's a patch fixing {{FuzzyQuery}} to also accept small terms. With the simplified {{FuzzyTermsEnum}} it was quite simple to fix it (remove the {{while}} loop), and to fix the test case to verify any term within the specified edit distance does match. The one wrinkle is that such matches get a boost of 0.0, because the formula we use to compute the boost for a matched term ({{1.0 - editDistance / minTermLength}}) can be <= 0. I think this is fair: such matches are poor quality compared to longer term matches. > Should FuzzyQuery match short terms too? > ---------------------------------------- > > Key: LUCENE-7439 > URL: https://issues.apache.org/jira/browse/LUCENE-7439 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7439.patch, LUCENE-7439.patch, LUCENE-7439.patch > > > Today, if you ask {{FuzzyQuery}} to match {{abcd}} with edit distance 2, it > will fail to match the term {{ab}} even though it's 2 edits away. > Its javadocs explain this: > {noformat} > * <p>NOTE: terms of length 1 or 2 will sometimes not match because of how > the scaled > * distance between two terms is computed. For a term to match, the edit > distance between > * the terms must be less than the minimum length term (either the input > term, or > * the candidate term). For example, FuzzyQuery on term "abcd" with > maxEdits=2 will > * not match an indexed term "ab", and FuzzyQuery on term "a" with maxEdits=2 > will not > * match an indexed term "abc". > {noformat} > On the one hand, I can see that this behavior is sort of justified in that > 50% of the characters are different and so this is a very "weak" match, but > on the other hand, it's quite unexpected since edit distance is such an exact > measure so the terms should have matched. > It seems like the behavior is caused by internal implementation details about > how the relative (floating point) score is computed. I think we should fix > it, so that edit distance 2 does in fact match all terms with edit distance > <= 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org