[ 
https://issues.apache.org/jira/browse/LUCENE-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860295#action_12860295
 ] 

Robert Muir commented on LUCENE-2265:
-------------------------------------

So here are the advantages of the current patch:
* full unicode support (Regular Expression, Wildcard, Fuzzy). for example, 
wildcard ? means codepoint, not code unit.
* support for matching all unicode forms easily (utf8, utf16, utf32). 
* easy to support both native utf8 terms sort order, but also utf8-in-utf16 
like we have now. this is not feasible with the existing utf16 representation.
* easy to safely do dfa operations on Automaton. this is because there are no 
surrogates anymore. for example we can safely reverse any automaton to take 
advantage of Solr's leading wildcard support (e.g. support "leading" regexps, 
too)
* better compatibility with lucene, because automaton is in sync with the terms 
format (byte). This could lead to future optimizations like TermsEnum exposing 
the 'shared prefix' of a term with the previous enumerated term.

Unfortunately, there are currently a few disadvantages with the patch, but I 
think we can resolve these:
* The linear fuzzy terms enum, from the old code, needs to be fixed and 
consistent and use utf32 calculations, too.
* for huge dfas (eg fuzzy) there is some cost to the conversion, around 5ms 
one-time cost on my machine for very long strings. perhaps we can optimize some 
code here, its not blowing up though.

So in my opinion, the first thing should be resolved before committing, and the 
second is nice-to-have and shouldn't block the improvement.


> improve automaton performance by running on byte[]
> --------------------------------------------------
>
>                 Key: LUCENE-2265
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2265
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: Flex Branch
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, 
> LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, 
> LUCENE-2265.patch, LUCENE-2265_pare.patch, LUCENE-2265_utf32.patch
>
>
> Currently, when enumerating terms, automaton must convert entire terms from 
> flex's native utf-8 byte[] to char[] first, then step each char thru the 
> state machine.
> we can make this more efficient, by allowing the state machine to run on 
> byte[], so it can return true/false faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to