Re: [PR] Terminate automaton after matched the whole prefix for PrefixQuery. [lucene]

via GitHub Wed, 06 Mar 2024 03:19:09 -0800


mikemccand commented on code in PR #13072:
URL: https://github.com/apache/lucene/pull/13072#discussion_r1514268562



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnum.java:
##########
@@ -534,6 +555,9 @@ private BytesRef _next() throws IOException {
         // Match!  Recurse:
         copyTerm();
         currentFrame = pushFrame(state);
+        if (matchAllSuffix) {

Review Comment:
   Couldn't you just do `currentFrame.matchAllSuffix = matchAllsuffix`?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java:
##########
@@ -96,6 +101,35 @@ protected RunAutomaton(Automaton a, int alphabetSize) {
     }
   }
 
+  /** Returns true if this state can accept everything(all remaining 
suffixes). */
+  private boolean canMatchAllSuffix(int state) {
+    assert automaton.isAccept(state);
+    int numTransitions = automaton.getNumTransitions(state);
+    // Apply to PrefixQuery, TermRangeQuery.
+    if (numTransitions == 1) {

Review Comment:
   Can we remove this special case?  Just let the `for` loop below handle the 
1-transition case too?
   
   Edit: hmm, I see, it is subtly different: this is checking for max label 
`255` but the loop below is checking `127`, hmmm.  This is a bit messy -- this 
low level of code shouldn't be specializing to different automata that come 
from the  high level queries.  Can we use `alphabetSize-1` as the 
`transition.max` check instead?  But, separately, we need to figure out why 
`Regexp/WildcardQuery` are compiling down to `127` as their max on `.*` suffix 
transitions?  That is not even correct for matching UTF-8 encoded terms.
   
   Perhaps we could also add tests cases for custom `Automata` passed to 
`AutomatonQuery` matching sometimes binary (non-UTF8) terms?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java:
##########
@@ -96,6 +101,35 @@ protected RunAutomaton(Automaton a, int alphabetSize) {
     }
   }
 
+  /** Returns true if this state can accept everything(all remaining 
suffixes). */
+  private boolean canMatchAllSuffix(int state) {
+    assert automaton.isAccept(state);
+    int numTransitions = automaton.getNumTransitions(state);
+    // Apply to PrefixQuery, TermRangeQuery.
+    if (numTransitions == 1) {
+      Transition transition = new Transition();
+      automaton.getTransition(state, 0, transition);
+      if (transition.dest == state && transition.min == 0 && transition.max == 
255) {
+        return true;
+      }
+    }
+
+    // Apply to RegexpQuery, WildcardQuery.
+    for (int i = 0; i < numTransitions; i++) {
+      Transition transition = new Transition();
+      automaton.getTransition(state, i, transition);
+      if (transition.dest == state && transition.min == 0 && transition.max == 
127) {

Review Comment:
   Maybe refactor this to:
   
   ```
     if (transition.min == 0 && transition.max == 127) {
       if (transition.dest == state) {
         return true;
       } else if (automaton.isAccept(transition.dest) {
         // recurse
         return isMatchAllSuffix(transition.dest);
       }
     }
   ```



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java:
##########
@@ -96,6 +101,35 @@ protected RunAutomaton(Automaton a, int alphabetSize) {
     }
   }
 
+  /** Returns true if this state can accept everything(all remaining 
suffixes). */
+  private boolean canMatchAllSuffix(int state) {
+    assert automaton.isAccept(state);
+    int numTransitions = automaton.getNumTransitions(state);
+    // Apply to PrefixQuery, TermRangeQuery.
+    if (numTransitions == 1) {
+      Transition transition = new Transition();
+      automaton.getTransition(state, 0, transition);
+      if (transition.dest == state && transition.min == 0 && transition.max == 
255) {
+        return true;
+      }
+    }
+
+    // Apply to RegexpQuery, WildcardQuery.
+    for (int i = 0; i < numTransitions; i++) {
+      Transition transition = new Transition();

Review Comment:
   Maybe allocate this scratch `Transition` once up front at the start of the 
method?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnum.java:
##########
@@ -390,10 +391,27 @@ private BytesRef _next() throws IOException {
 
     nextTerm:
     while (true) {
+      // Match sub block's entry directly.
+      if (currentFrame.matchAllSuffix) {
+        while (true) {
+          // TODO: Is it necessary to set currentTransition, state etc.

Review Comment:
   I don't think this is needed, since you short-circuit all the logic here for 
the "match all" case, and when this frame is popped it will be reset properly 
(with `currentTransition`, `state`, etc.) on the next recursion.  So maybe 
change this `TODO` into a comment as to why we don't need to set these members?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java:
##########
@@ -67,12 +68,16 @@ protected RunAutomaton(Automaton a, int alphabetSize) {
     points = a.getStartPoints();
     size = Math.max(1, a.getNumStates());
     accept = new FixedBitSet(size);
+    matchAllSuffix = new FixedBitSet(size);
     transitions = new int[size * points.length];
     Arrays.fill(transitions, -1);
     Transition transition = new Transition();
     for (int n = 0; n < size; n++) {
       if (a.isAccept(n)) {
         accept.set(n);
+        if (canMatchAllSuffix(n)) {

Review Comment:
   Maybe rename to `isMatchAllSuffix`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Terminate automaton after matched the whole prefix for PrefixQuery. [lucene]

Reply via email to