[jira] [Commented] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton

Dawid Weiss (JIRA) Wed, 09 Aug 2017 00:46:18 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119534#comment-16119534
 ]


Dawid Weiss commented on LUCENE-7921:
-------------------------------------

No, what I meant is this:
{code}
states: 118
transitions: 319
-------------------------------
states: 118
transitions: 319
-------------------------------
{code}
As you see the two final minimal representations are identical -- the code that 
converts the automaton to a minimal deterministic automaton should be looked 
into as to why it explodes in the second case; the state count check shouldn't 
explode then, just as in the first example. So I'm not saying you're wrong, but 
it's not about optimizing or rewriting the regexp, it's about fixing the 
determinization routine.

bq. Elasticsearch has no problems executing it (in the unrolled variant).

ES uses Lucene code underneath (correct me if I'm wrong) so if you use the same 
Lucene version you should observe the same result. There were some recent 
commits to this expansion check -- perhaps it's a regression.


> More efficient way to transform a RegExp to an Automaton
> --------------------------------------------------------
>
>                 Key: LUCENE-7921
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7921
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 6.5.1
>            Reporter: Thomas Poppe
>            Priority: Minor
>
> Consider the following example:
> {code:title=ToAutomatonExample.java|borderStyle=solid}
>     public static void main(String[] args) {
>         org.apache.lucene.util.automaton.RegExp regExp =
>                 new 
> org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]{0,8}");
>         org.apache.lucene.util.automaton.Automaton automaton = 
> regExp.toAutomaton();
>         System.out.println("states: " + automaton.getNumStates());
>         System.out.println("transitions: " + automaton.getNumTransitions());
>         System.out.println("-------------------------------");
>         try {
>             regExp = new 
> org.apache.lucene.util.automaton.RegExp("[a-z]{1,13}x[a-z]{1,13}");
>             automaton = regExp.toAutomaton();
>             System.out.println("Will not happen...");
>         } catch 
> (org.apache.lucene.util.automaton.TooComplexToDeterminizeException e) {
>             automaton = regExp.toAutomaton(1_000_000);
>             System.out.println("states: " + automaton.getNumStates());
>             System.out.println("transitions: " + 
> automaton.getNumTransitions());
>             System.out.println("-------------------------------");
>         }
>     }
> {code}
> Both regular expressions are equivalent, but it's much more efficient to 
> "unroll" the repetition.  It might be possible to optimize the 
> Regex#toAutomaton() method to handle this repetition without going over the 
> default number of determinized states, and using less memory and CPU?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7921) More efficient way to transform a RegExp to an Automaton

Reply via email to