[ 
https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Poppe updated LUCENE-7921:
---------------------------------
    Description: 
Consider the following example:

{code:title=ToAutomatonExample.java|borderStyle=solid}
    public static void main(String[] args) {
        org.apache.lucene.util.automaton.RegExp regExp =
                new 
org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}");
        org.apache.lucene.util.automaton.Automaton automaton = 
regExp.toAutomaton();
        System.out.println("states: " + automaton.getNumStates());
        System.out.println("transitions: " + automaton.getNumTransitions());
        System.out.println("-------------------------------");

        try {
            regExp = new 
org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}");
            automaton = regExp.toAutomaton();
            System.out.println("Will not happen...");
        } catch 
(org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) {
            automaton = regExp.toAutomaton(1_000_000);
            System.out.println("states: " + automaton.getNumStates());
            System.out.println("transitions: " + automaton.getNumTransitions());
            System.out.println("-------------------------------");
        }
    }
{code}

Both regular expressions are equivalent, but it's much more efficient to 
"unroll" the repetition.  It might be possible to optimize the 
Regex#toAutomaton() method to handle this repetition without going over the 
default number of determinized states, and using less memory and CPU?

  was:
Consider the following example:

    public static void main(String[] args) {
        org.apache.lucene.util.automaton.RegExp regExp =
                new 
org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}");
        org.apache.lucene.util.automaton.Automaton automaton = 
regExp.toAutomaton();
        System.out.println("states: " + automaton.getNumStates());
        System.out.println("transitions: " + automaton.getNumTransitions());
        System.out.println("-------------------------------");

        try {
            regExp = new 
org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}");
            automaton = regExp.toAutomaton();
            System.out.println("Will not happen...");
        } catch 
(org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) {
            automaton = regExp.toAutomaton(1_000_000);
            System.out.println("states: " + automaton.getNumStates());
            System.out.println("transitions: " + automaton.getNumTransitions());
            System.out.println("-------------------------------");
        }
    }

Both regular expressions are equivalent, but it's much more efficient to 
"unroll" the repetition.  It might be possible to optimize the 
Regex#toAutomaton() method to handle this repetition without going over the 
default number of determinized states, and using less memory and CPU?


> More efficient way to transform a RegExp to an Automaton
> --------------------------------------------------------
>
>                 Key: LUCENE-7921
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7921
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 6.5.1
>            Reporter: Thomas Poppe
>            Priority: Minor
>
> Consider the following example:
> {code:title=ToAutomatonExample.java|borderStyle=solid}
>     public static void main(String[] args) {
>         org.apache.lucene.util.automaton.RegExp regExp =
>                 new 
> org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}");
>         org.apache.lucene.util.automaton.Automaton automaton = 
> regExp.toAutomaton();
>         System.out.println("states: " + automaton.getNumStates());
>         System.out.println("transitions: " + automaton.getNumTransitions());
>         System.out.println("-------------------------------");
>         try {
>             regExp = new 
> org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}");
>             automaton = regExp.toAutomaton();
>             System.out.println("Will not happen...");
>         } catch 
> (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) {
>             automaton = regExp.toAutomaton(1_000_000);
>             System.out.println("states: " + automaton.getNumStates());
>             System.out.println("transitions: " + 
> automaton.getNumTransitions());
>             System.out.println("-------------------------------");
>         }
>     }
> {code}
> Both regular expressions are equivalent, but it's much more efficient to 
> "unroll" the repetition.  It might be possible to optimize the 
> Regex#toAutomaton() method to handle this repetition without going over the 
> default number of determinized states, and using less memory and CPU?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to