[ 
https://issues.apache.org/jira/browse/LUCENE-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038638#comment-17038638
 ] 

Robert Muir commented on LUCENE-9231:
-------------------------------------

Attached patch adds a helpful check to the build that fails much faster than 
waiting for some OOM.

> fix algorithmic worst-case in regeneration of URL tokenizer
> -----------------------------------------------------------
>
>                 Key: LUCENE-9231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9231
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Robert Muir
>            Priority: Major
>         Attachments: LUCENE-9231_build_check.patch
>
>
> For the UAX29URLEmailTokenizer, the regeneration task is slow. It also 
> requires a very large amount of heap space (I just increased mine after 
> seeing it struggle under GC).
> Maybe we can dig into the worst case and figure out what is happening, it 
> seems to be an automaton issue:
> {noformat}
> "main" #1 prio=5 os_prio=0 cpu=132097.25ms elapsed=135.75s 
> tid=0x00007fb1d4018000 nid=0x19706 runnable  [0x00007fb1db3df000]
>    java.lang.Thread.State: RUNNABLE
>       at jflex.StateSet.add(StateSet.java:218)
>       at jflex.NFA.closure(NFA.java:387)
>       at jflex.NFA.epsilonFill(NFA.java:410)
>       at jflex.NFA.complement(NFA.java:737)
>       at jflex.NFA.insertNFA(NFA.java:1029)
>       at jflex.NFA.insertNFA(NFA.java:971)
>       at jflex.NFA.insertNFA(NFA.java:1029)
>       at jflex.NFA.insertNFA(NFA.java:972)
>       at jflex.NFA.insertNFA(NFA.java:987)
>       at jflex.NFA.insertNFA(NFA.java:988)
>       at jflex.NFA.insertNFA(NFA.java:987)
>       at jflex.NFA.insertNFA(NFA.java:971)
>       at jflex.NFA.insertNFA(NFA.java:1041)
>       at jflex.NFA.insertNFA(NFA.java:987)
>       at jflex.NFA.insertNFA(NFA.java:971)
>       at jflex.NFA.insertNFA(NFA.java:971)
>       at jflex.NFA.addRegExp(NFA.java:151)
>       at 
> jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action_part00000000(LexParse.java:1401)
>       at 
> jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action(LexParse.java:3415)
>       at jflex.LexParse.do_action(LexParse.java:939)
>       at java_cup.runtime.lr_parser.parse(lr_parser.java:699)
>       at jflex.Main.generate(Main.java:73)
>       at jflex.anttask.JFlexTask.execute(JFlexTask.java:72)
> {noformat}
> Stacks seem to be typically in {{jflex.StateSet.add(StateSet.java:218)}} and 
> {{jflex.StateSet.complement(StateSet.java:173)}} and many operations, but 
> always come from {{addRegExp}} .. {{insertNFA}} .. {{complement}} codepath.
> Feels like something has a bad runtime, I wonder if we can fix it (or at 
> least make it better, e.g. check for some GB ram heap minimum, print a 
> warning how long it will take, etc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to