This is just an optimization; maybe we should expose an option to disable it?
Or maybe we can find the common suffix on an NFA instead, to avoid determinization? Can you open a Jira issue so we can discuss options? Thanks, Mike McCandless http://blog.mikemccandless.com On Fri, Dec 15, 2017 at 5:38 AM, Poppe, Thomas (IP&Science) < thomas.po...@clarivate.com> wrote: > Hello, > > We're using the automaton package as part of Elasticsearch for doing > regexp queries. Our business requires us to process rather complex regular > expressions, for example (we have more complex examples, but this one > illustrates the problem): > > (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d > > With a large enough value of maxDeterminizedStates, this works. The > problem we're having is that the conversion of this regular expression to a > CompiledAutomaton takes very long. Almost all of the time goes into > determining the common suffix for the Automaton (which is "d" in this > example) - calculated with a call to Operations.getCommonSuffixBytesRef. > > If my understanding is correct, this suffix is only used as an > optimization (is this correct?). Skipping the calculation of this suffix > allows us to process these kinds of queries. > > So here are my questions: > - Would it be possible to introduce a way to skip the calculation of this > common suffix (ideally something we control from within our query to > Elasticsearch)? > - Or would it be possible to take a look at this getCommonSuffixBytesRef > operation, to see if it can be optimized? Most of the time goes to > determinizing the reversed automaton - maybe this can be avoided somehow? > - Does anyone have any other suggestions? We've tried to reduce the > complexity of the query, but it is something we would really like to > support. > > Thanks, > Thomas Poppe > >