This unfortunately is a limitation of the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space.
This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. Could you open an issue for this? I won't have any time soon to work on this but we should open an issue to discuss / see if someone else has time / iterate. Thanks! Mike McCandless http://blog.mikemccandless.com On Thu, May 30, 2013 at 8:39 AM, Artem Lukanin <[email protected]> wrote: > BTW, I have to set maxEdits=2 to allow letter transpositions in Russian, > because there will be actually 2 transpositions of 4 bytes representing 2 > Russian letters in UTF-8. > > The worst case is when one field has both Russian and English letters (or > e.g. numbers), where I have to use minFuzzyLength=6 and maxEdits=2, which > will work only for Russian words of more than 2 letters and for English > words of more than 5 letters! > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067026.html > Sent from the Lucene - General mailing list archive at Nabble.com.
