Jacky040124 commented on PR #15232: URL: https://github.com/apache/lucene/pull/15232#issuecomment-3358884637
@rmuir Thank you so much for your help. I confirmed the hypothesis that the original benchmark patterns were already optimised to DFAs by Lucene's parser, causing Operations.determinize() to be a no-op. ``` Pattern: a+ → Clean DFA Pattern: .* → Clean DFA Pattern: [0-9]+ → Clean DFA Pattern: a(b+|c+)d → Clean DFA Pattern: (cat|dog|bird|fish|mouse) → Clean DFA Only pattern: (a+)+ → NFA (0.003 ms/op) ``` I've written a thrown away test script using the assertCleanDFA method as you suggested, extracted and tested 27 patterns from OpenJDK's test suite (from the SO post), and found 11 NFA patterns that force determinization work: ``` ^(a)?a, ((a|b)?b)+, (aaa)?aaa, ^(a(b(c)?)?)?abc, (a+)+, (a*)+, (b+)+, (|f)?+, (y+)*, (foo|foobar)*, (aa+|bb+)+ ``` Also notice something very interesting, the two patterns explicitly marked as "Nondeterministic group" in OpenJDK's test suite seems to be optimized to DFAs by Lucene's parser (≈ 10⁻⁶ ms/op), contradicting OpenJDK's "nondeterministic" label ```java // Nondeterministic group (a+b)+ (a|b)+ ``` I've updated RegexDeterminizeBenchmark.java with the 11 verified NFA patterns, ran the benchmark, and committed the results. All patterns now show meaningful determination work (0.001-0.014 ms/op vs ≈ 10⁻⁶ ms/op previously). The benchmark appears to be working correctly now. Please let me know if any additional changes are needed. Thank you again for your help, please let me know if any extra change is needed ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
