mikemccand commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1205661713


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> 
(inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends 
DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code 
points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);

Review Comment:
   I don't think Lucene has a `BytesRef` (UTF8) equivalent of 
`Character.codePointAt` and `Character.charCount` (byteCount), but it's quite 
trivial to implement ... UTF-8 makes this easy by just looking at the top 
(sign) bit of each byte to see if the character "continues", I think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to