[ https://issues.apache.org/jira/browse/LUCENE-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860295#action_12860295 ]
Robert Muir commented on LUCENE-2265: ------------------------------------- So here are the advantages of the current patch: * full unicode support (Regular Expression, Wildcard, Fuzzy). for example, wildcard ? means codepoint, not code unit. * support for matching all unicode forms easily (utf8, utf16, utf32). * easy to support both native utf8 terms sort order, but also utf8-in-utf16 like we have now. this is not feasible with the existing utf16 representation. * easy to safely do dfa operations on Automaton. this is because there are no surrogates anymore. for example we can safely reverse any automaton to take advantage of Solr's leading wildcard support (e.g. support "leading" regexps, too) * better compatibility with lucene, because automaton is in sync with the terms format (byte). This could lead to future optimizations like TermsEnum exposing the 'shared prefix' of a term with the previous enumerated term. Unfortunately, there are currently a few disadvantages with the patch, but I think we can resolve these: * The linear fuzzy terms enum, from the old code, needs to be fixed and consistent and use utf32 calculations, too. * for huge dfas (eg fuzzy) there is some cost to the conversion, around 5ms one-time cost on my machine for very long strings. perhaps we can optimize some code here, its not blowing up though. So in my opinion, the first thing should be resolved before committing, and the second is nice-to-have and shouldn't block the improvement. > improve automaton performance by running on byte[] > -------------------------------------------------- > > Key: LUCENE-2265 > URL: https://issues.apache.org/jira/browse/LUCENE-2265 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Affects Versions: Flex Branch > Reporter: Robert Muir > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, > LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, LUCENE-2265.patch, > LUCENE-2265.patch, LUCENE-2265_pare.patch, LUCENE-2265_utf32.patch > > > Currently, when enumerating terms, automaton must convert entire terms from > flex's native utf-8 byte[] to char[] first, then step each char thru the > state machine. > we can make this more efficient, by allowing the state machine to run on > byte[], so it can return true/false faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org