[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353637#comment-17353637 ]
Robert Muir commented on LUCENE-9981: ------------------------------------- Before we {{det(reverse())}} this automaton to compute the common suffix for the optimization, note that it has: * 52000 states * 84000 transitions I think we should just not even bother to do this optimization if the automaton is "big" (say more than 1,000 states or more than 1,000 transitions). Something like this is a 1-line change: {noformat} --- a/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java +++ b/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java @@ -237,7 +237,7 @@ public class CompiledAutomaton implements Accountable { binary = new UTF32ToUTF8().convert(automaton); } - if (this.finite) { + if (this.finite || binary.getNumStates() > 1000 || binary.getNumTransitions() > 1000) { commonSuffixRef = null; } else { // NOTE: this is a very costly operation! We should test if it's really warranted in {noformat} This means the test passes in 5 seconds instead of failing with exception after 268 seconds. I think its a good general fix for this specific method. But still, I think there might be more interesting problems here: 5 seconds still sucks, especially on a 0-document index. And you can be sure I don't want to add a unit test taking 5 seconds of cpu time :) > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > ------------------------------------------------------------------------------------------------------------ > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task > Reporter: Robert Muir > Priority: Major > Attachments: LUCENE-9981_test.patch > > > We have a {{maxDeterminizedStates = 10000}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x00007fff7006ca80 nid=0x231c8 > runnable [0x00007fff8b7f0000] > java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.<init>(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { > // this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. > // if we don't set this common suffix, the query will just run a bit > slower, that's all. > int limit = Math.min(1000, maxDeterminizedStates); > BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); > ... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > But I still don't quite understand how the current limits are giving the > behavior today, maybe there is a bigger issue and I don't want to shove > something under the rug. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org