Jacky040124 commented on PR #15232:
URL: https://github.com/apache/lucene/pull/15232#issuecomment-3358884637

   @rmuir Thank you so much for your help.
   
   I confirmed the hypothesis that the original benchmark patterns were already 
optimised to DFAs by Lucene's parser, causing Operations.determinize() to be a 
no-op.
   
   ```
   Pattern: a+                    → Clean DFA
   Pattern: .*                    → Clean DFA
   Pattern: [0-9]+                → Clean DFA 
   Pattern: a(b+|c+)d             → Clean DFA
   Pattern: (cat|dog|bird|fish|mouse) → Clean DFA 
   Only pattern: (a+)+           → NFA (0.003 ms/op)
   ```
   
   I've written a thrown away test script using the assertCleanDFA method as 
you suggested, extracted and tested 27 patterns from OpenJDK's test suite (from 
the SO post), and found 11 NFA patterns that force determinization work:
   ```
   ^(a)?a, ((a|b)?b)+, (aaa)?aaa, ^(a(b(c)?)?)?abc,
   (a+)+, (a*)+, (b+)+, (|f)?+, (y+)*,
   (foo|foobar)*, (aa+|bb+)+
   ```
   
   Also notice something very interesting, the two patterns explicitly marked 
as "Nondeterministic group" in OpenJDK's test suite seems to be optimized to 
DFAs by Lucene's parser (≈ 10⁻⁶ ms/op), contradicting OpenJDK's 
"nondeterministic" label
   
   ```java
   // Nondeterministic group
   (a+b)+
   (a|b)+
   ```
   
   I've updated RegexDeterminizeBenchmark.java with the 11 verified NFA 
patterns, ran the benchmark, and committed the results. All patterns now show 
meaningful determination work (0.001-0.014 ms/op vs ≈ 10⁻⁶ ms/op previously). 
The benchmark appears to be working correctly now. Please let me know if any 
additional changes are needed. 
   
   Thank you again for your help, please let me know if any extra change is 
needed !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to