[ https://issues.apache.org/jira/browse/LUCENE-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038638#comment-17038638 ]
Robert Muir commented on LUCENE-9231: ------------------------------------- Attached patch adds a helpful check to the build that fails much faster than waiting for some OOM. > fix algorithmic worst-case in regeneration of URL tokenizer > ----------------------------------------------------------- > > Key: LUCENE-9231 > URL: https://issues.apache.org/jira/browse/LUCENE-9231 > Project: Lucene - Core > Issue Type: Wish > Reporter: Robert Muir > Priority: Major > Attachments: LUCENE-9231_build_check.patch > > > For the UAX29URLEmailTokenizer, the regeneration task is slow. It also > requires a very large amount of heap space (I just increased mine after > seeing it struggle under GC). > Maybe we can dig into the worst case and figure out what is happening, it > seems to be an automaton issue: > {noformat} > "main" #1 prio=5 os_prio=0 cpu=132097.25ms elapsed=135.75s > tid=0x00007fb1d4018000 nid=0x19706 runnable [0x00007fb1db3df000] > java.lang.Thread.State: RUNNABLE > at jflex.StateSet.add(StateSet.java:218) > at jflex.NFA.closure(NFA.java:387) > at jflex.NFA.epsilonFill(NFA.java:410) > at jflex.NFA.complement(NFA.java:737) > at jflex.NFA.insertNFA(NFA.java:1029) > at jflex.NFA.insertNFA(NFA.java:971) > at jflex.NFA.insertNFA(NFA.java:1029) > at jflex.NFA.insertNFA(NFA.java:972) > at jflex.NFA.insertNFA(NFA.java:987) > at jflex.NFA.insertNFA(NFA.java:988) > at jflex.NFA.insertNFA(NFA.java:987) > at jflex.NFA.insertNFA(NFA.java:971) > at jflex.NFA.insertNFA(NFA.java:1041) > at jflex.NFA.insertNFA(NFA.java:987) > at jflex.NFA.insertNFA(NFA.java:971) > at jflex.NFA.insertNFA(NFA.java:971) > at jflex.NFA.addRegExp(NFA.java:151) > at > jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action_part00000000(LexParse.java:1401) > at > jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action(LexParse.java:3415) > at jflex.LexParse.do_action(LexParse.java:939) > at java_cup.runtime.lr_parser.parse(lr_parser.java:699) > at jflex.Main.generate(Main.java:73) > at jflex.anttask.JFlexTask.execute(JFlexTask.java:72) > {noformat} > Stacks seem to be typically in {{jflex.StateSet.add(StateSet.java:218)}} and > {{jflex.StateSet.complement(StateSet.java:173)}} and many operations, but > always come from {{addRegExp}} .. {{insertNFA}} .. {{complement}} codepath. > Feels like something has a bad runtime, I wonder if we can fix it (or at > least make it better, e.g. check for some GB ram heap minimum, print a > warning how long it will take, etc) -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org