Thomas Poppe created LUCENE-7921: ------------------------------------ Summary: More efficient way to transform a RegExp to an Automaton Key: LUCENE-7921 URL: https://issues.apache.org/jira/browse/LUCENE-7921 Project: Lucene - Core Issue Type: Improvement Affects Versions: 6.5.1 Reporter: Thomas Poppe Priority: Minor
Consider the following example: public static void main(String[] args) { org.apache.lucene.util.automaton.RegExp regExp = new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}"); org.apache.lucene.util.automaton.Automaton automaton = regExp.toAutomaton(); System.out.println("states: " + automaton.getNumStates()); System.out.println("transitions: " + automaton.getNumTransitions()); System.out.println("-------------------------------"); try { regExp = new org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}"); automaton = regExp.toAutomaton(); System.out.println("Will not happen..."); } catch (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) { automaton = regExp.toAutomaton(1_000_000); System.out.println("states: " + automaton.getNumStates()); System.out.println("transitions: " + automaton.getNumTransitions()); System.out.println("-------------------------------"); } } Both regular expressions are equivalent, but it's much more efficient to "unroll" the repetition. It might be possible to optimize the Regex#toAutomaton() method to handle this repetition without going over the default number of determinized states, and using less memory and CPU? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org