[ https://issues.apache.org/jira/browse/LUCENE-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295003#comment-16295003 ]
Robert Muir commented on LUCENE-8102: ------------------------------------- The optimization speeds up leading wildcard type-queries (such as wildcard "*foo" or regex ".*foo") which are far more common e.g. on the users list than the example presented on this issue, so I really think we should keep it. The regex here is quite obscene, maybe its the thing that should be optimized? :) > CompiledAutomaton performance for determining common suffix > ----------------------------------------------------------- > > Key: LUCENE-8102 > URL: https://issues.apache.org/jira/browse/LUCENE-8102 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Affects Versions: 7.1 > Reporter: Thomas Poppe > > We're using the automaton package as part of Elasticsearch for doing regexp > queries. Our business requires us to process rather complex regular > expressions, for example (we have more complex examples, but this one > illustrates the problem): > {noformat} > (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d > {noformat} > With a large enough value of maxDeterminizedStates, this works. The problem > we're having is that the conversion of this regular expression to a > CompiledAutomaton takes very long. Almost all of the time goes into > determining the common suffix for the Automaton (which is "d" in this > example) - calculated with a call to Operations.getCommonSuffixBytesRef. > This suffix is only used as an optimization. Skipping the calculation of > this suffix allows us to process these kinds of queries. > - Would it be possible to introduce a way to skip the calculation of this > common suffix (ideally something we control from within our query to > Elasticsearch)? > - Or would it be possible to take a look at this getCommonSuffixBytesRef > operation, to see if it can be optimized? Most of the time goes to > determinizing the reversed automaton - maybe this can be avoided somehow? > Reaction from Mike McCandless on the mailing list: > This is just an optimization; maybe we should expose an option to disable it? > Or maybe we can find the common suffix on an NFA instead, to avoid > determinization? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org