[GitHub] [lucene] mikemccand commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

via GitHub Thu, 25 May 2023 08:07:55 -0700


mikemccand commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1205600165



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> 
(inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends 
DaciukMihovAutomatonBuilder {

Review Comment:
   `final` too?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> 
(inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends 
DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code 
points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);
+        State next = state.lastChild(codePoint);
+        if (next == null) {
+          break;
+        }
+
+        state = next;
+        pos += Character.charCount(codePoint);
+      }
+
+      if (state.hasChildren()) replaceOrRegister(state);
+
+      addSuffix(state, currentChars, pos);
+    }
+
+    /**
+     * Add a suffix of <code>current</code> starting at <code>fromIndex</code> 
(inclusive) to state
+     * <code>state</code>.
+     */
+    private static void addSuffix(State state, CharSequence current, int 
fromIndex) {
+      final int len = current.length();
+      while (fromIndex < len) {
+        int cp = Character.codePointAt(current, fromIndex);
+        state = state.newState(cp);

Review Comment:
   Hmm, I wonder how this is creating a minimal Automaton?  It seems to create 
a new path for every suffix without sharing the common suffixes?
   
   (This is not a problem with this PR but rather a pre-existing issue and 
likely my not understanding this algorithm!).
   
   Actually, I think this is nearly the same algorithm as the FST Builder, just 
applied to automaton (no outputs) instead of FST.
   
   Edit: maybe minimizing the "tail" of the automaton happens in `convert`?
   
   Edit 2: actually, I think we could maybe further optimize this builder to 
directly build `Automaton` instead of first creating its intermediate (and more 
RAM consuming?) automaton representation.  Future work :)



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> 
(inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends 
DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code 
points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);
+        State next = state.lastChild(codePoint);
+        if (next == null) {
+          break;
+        }
+
+        state = next;
+        pos += Character.charCount(codePoint);
+      }
+
+      if (state.hasChildren()) replaceOrRegister(state);
+
+      addSuffix(state, currentChars, pos);
+    }
+
+    /**
+     * Add a suffix of <code>current</code> starting at <code>fromIndex</code> 
(inclusive) to state
+     * <code>state</code>.
+     */
+    private static void addSuffix(State state, CharSequence current, int 
fromIndex) {
+      final int len = current.length();
+      while (fromIndex < len) {
+        int cp = Character.codePointAt(current, fromIndex);
+        state = state.newState(cp);
+        fromIndex += Character.charCount(cp);
+      }
+      state.is_final = true;

Review Comment:
   Do we have a unit test for this class that generates random strings in a 
smallish alphabet, uses this builder to create the minimal automaton, and then 
builds an inefficient automaton with the existing union methods, then 
minimizing in the end, then asserting that the two ways for creating the 
minimal automaton (simple yet slow, complex but fast) produce identical 
(isomorphic) automaton?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> 
(inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends 
DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code 
points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);

Review Comment:
   It looks like we were already doing this conversion previously?  So this 
change is not adding more cost in the `CharsRef` case?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> 
(inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends 
DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code 
points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);

Review Comment:
   We could also decode the next Unicode code point directly from the UTF-8 
bytes, instead of converting up front to a `CharsRef`?  Or maybe just convert 
to `int[]` (`UnicodeUtil.UTF8toUTF32`)?
   
   If we did the former (decode directly from `BytesRef`) we could perhaps not 
even make subclasses here and just have a small `if` in each of the 
add/addSuffix methods to pull the "next int" (either a UTF-8 unit or Unicode 
code point) on each loop.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] mikemccand commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Reply via email to