[jira] [Commented] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

Michael McCandless (JIRA) Wed, 19 Jun 2013 10:28:24 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688185#comment-13688185
 ]


Michael McCandless commented on LUCENE-5030:
--------------------------------------------

The easy performance tester to run is
lucene/suggest/src/test/org/apache/lucene/search/suggest/LookupBenchmarkTest.java
... we should test that first I think?  I can also run one based on
FreeDB ... the sources are in luceneutil
(https://code.google.com/a/apache-extras.org/p/luceneutil/ ).

If the perf hit is too much then one option would be to make it
optional (whether we count edits in Unicode space UTF-8 space), or
maybe just another suggester class (FuzzyUnicodeSuggester?).

I think we can use INFO_SEP: yes, this is used for PAYLOAD_SEP, but
that only means the incoming surfaceForm cannot contain this char, I
think?  So ... I think we are free to use it in the analyzed form?  Or
did something go wrong when you tried?

Whichever chars we use (steal), we should add checks that these chars do not
occur in the input...

                
> FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
> correctly for 1-byte (like English) and multi-byte (non-Latin) letters
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5030
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5030
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.3
>            Reporter: Artem Lukanin
>         Attachments: nonlatin_fuzzySuggester1.patch, 
> nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
> nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch, 
> nonlatin_fuzzySuggester.patch
>
>
> There is a limitation in the current FuzzySuggester implementation: it 
> computes edits in UTF-8 space instead of Unicode character (code point) 
> space. 
> This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
> Unicode character space, then fix FuzzySuggester to do the same steps that 
> FuzzyQuery does: do the LevN expansion in Unicode character space, then 
> convert that automaton to UTF-8, then intersect with the suggest FST.
> See the discussion here: 
> http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

Reply via email to