Re: [PR] OPENNLP-1862: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7) (opennlp)

via GitHub Wed, 01 Jul 2026 23:49:17 -0700


mawiesne commented on code in PR #1110:
URL: https://github.com/apache/opennlp/pull/1110#discussion_r3510866855



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/ExtendedPictographic.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UncheckedIOException;
+import java.nio.charset.StandardCharsets;
+import java.util.BitSet;
+
+/**
+ * Tests the Unicode {@code Extended_Pictographic} property of a code point.
+ *
+ * <p>This is the one extra property the word boundary algorithm needs (rule 
WB3c), to keep emoji
+ * zero-width-joiner sequences together. The data is loaded once from the 
{@code emoji-data.txt}
+ * derived resource of the Unicode Character Database and stored in a {@link 
BitSet}, so membership
+ * is an O(1) bit test.</p>
+ */
+public final class ExtendedPictographic {
+
+  private static final String RESOURCE = "ExtendedPictographic.txt";
+
+  private static volatile BitSet members;
+
+  private ExtendedPictographic() {
+  }
+
+  // Package-visible so a per-pass caller can resolve the set once (see 
is(BitSet, int)) rather than
+  // once per code point.
+  static BitSet members() {

Review Comment:
   Please add proper Javadoc here and mention transitive exceptions that might 
occur, namely: `IllegalStateException` and `UncheckedIOException`.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/ExtendedPictographic.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UncheckedIOException;
+import java.nio.charset.StandardCharsets;
+import java.util.BitSet;
+
+/**
+ * Tests the Unicode {@code Extended_Pictographic} property of a code point.
+ *
+ * <p>This is the one extra property the word boundary algorithm needs (rule 
WB3c), to keep emoji
+ * zero-width-joiner sequences together. The data is loaded once from the 
{@code emoji-data.txt}
+ * derived resource of the Unicode Character Database and stored in a {@link 
BitSet}, so membership
+ * is an O(1) bit test.</p>

Review Comment:
   Pls "test" => "check"



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreakProperty.java:
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UncheckedIOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+/**
+ * Looks up the Unicode {@link WordBreak Word_Break} property of a code point.
+ *
+ * <p>The data is loaded once from the {@code WordBreakProperty.txt} resource 
of the Unicode
+ * Character Database (parsed with simple cursor scanning, no regular 
expression). Lookup is O(1)
+ * for the Basic Multilingual Plane (a direct array index) and O(log n) for 
supplementary code
+ * points (a binary search over a small sorted range table), so it imposes no 
per-character
+ * allocation on the word boundary algorithm.</p>
+ */
+public final class WordBreakProperty {
+
+  private static final String RESOURCE = "WordBreakProperty.txt";
+
+  private static final WordBreak[] VALUES = WordBreak.values();
+
+  private static volatile Data data;

Review Comment:
   Please leave a well-founded comment on why this property is declared 
`volatile` here.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreakProperty.java:
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UncheckedIOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+/**
+ * Looks up the Unicode {@link WordBreak Word_Break} property of a code point.
+ *
+ * <p>The data is loaded once from the {@code WordBreakProperty.txt} resource 
of the Unicode
+ * Character Database (parsed with simple cursor scanning, no regular 
expression). Lookup is O(1)
+ * for the Basic Multilingual Plane (a direct array index) and O(log n) for 
supplementary code
+ * points (a binary search over a small sorted range table), so it imposes no 
per-character
+ * allocation on the word boundary algorithm.</p>
+ */
+public final class WordBreakProperty {
+
+  private static final String RESOURCE = "WordBreakProperty.txt";
+
+  private static final WordBreak[] VALUES = WordBreak.values();
+
+  private static volatile Data data;
+
+  private WordBreakProperty() {
+  }
+
+  // Immutable Word_Break tables: ordinal per BMP code point, plus 
supplementary ranges sorted by
+  // start for binary search. Package-visible so a caller that looks up many 
code points in one pass
+  // (WordSegmenter) can resolve this once and reuse it, instead of paying the 
volatile read behind
+  // data() on every call.
+  static final class Data {
+    final byte[] bmp;
+    final int[] supplementaryStart;
+    final int[] supplementaryEnd;
+    final byte[] supplementaryValue;
+
+    Data(byte[] bmp, int[] start, int[] end, byte[] value) {
+      this.bmp = bmp;
+      this.supplementaryStart = start;
+      this.supplementaryEnd = end;
+      this.supplementaryValue = value;
+    }
+  }
+
+  // Package-visible so a per-pass caller can resolve the table once (see the 
ordinalOf/of overloads
+  // that take a resolved Data) rather than once per code point.
+  static Data data() {
+    Data d = data;
+    if (d == null) {
+      synchronized (WordBreakProperty.class) {
+        d = data;
+        if (d == null) {
+          d = load();
+          data = d;
+        }
+      }
+    }
+    return d;
+  }
+
+  private static Data load() {
+    final byte[] bmp = new byte[0x10000];
+    final List<int[]> supplementary = new ArrayList<>();
+    try (InputStream in = 
WordBreakProperty.class.getResourceAsStream(RESOURCE)) {
+      if (in == null) {
+        throw new IllegalStateException("Missing Word_Break data resource: " + 
RESOURCE);
+      }
+      parse(in, bmp, supplementary);
+    } catch (IOException e) {
+      throw new UncheckedIOException("Unable to read Word_Break data resource 
" + RESOURCE, e);
+    }
+    supplementary.sort((a, b) -> Integer.compare(a[0], b[0]));
+    final int[] start = new int[supplementary.size()];
+    final int[] end = new int[supplementary.size()];
+    final byte[] value = new byte[supplementary.size()];
+    for (int i = 0; i < supplementary.size(); i++) {
+      final int[] range = supplementary.get(i);
+      start[i] = range[0];
+      end[i] = range[1];
+      value[i] = (byte) range[2];
+    }
+    return new Data(bmp, start, end, value);
+  }
+
+  // Package-visible so the malformed-data handling can be exercised without 
the bundled resource.
+  static void parse(InputStream in, byte[] bmp, List<int[]> supplementary) 
throws IOException {
+    try (BufferedReader reader =
+             new BufferedReader(new InputStreamReader(in, 
StandardCharsets.UTF_8))) {
+      String line;
+      while ((line = reader.readLine()) != null) {
+        final int hash = line.indexOf('#');
+        final String content = (hash < 0 ? line : line.substring(0, 
hash)).strip();
+        if (content.isEmpty()) {
+          continue;
+        }
+        final int semicolon = content.indexOf(';');
+        if (semicolon < 0) {
+          // A present-but-structurally-wrong line (no ';' to split code 
points from the value) is a
+          // hard error naming the line, mirroring ExtendedPictographic, not 
an opaque substring throw.
+          throw new IllegalStateException(
+              "Malformed Word_Break data in " + RESOURCE + " (no ';'): " + 
content);
+        }
+        final String codePoints = content.substring(0, semicolon).strip();
+        final String value = content.substring(semicolon + 1).strip();
+        final byte ordinal = (byte) 
WordBreak.fromPropertyName(value).ordinal();
+
+        final int dots = codePoints.indexOf("..");
+        final int start;
+        final int end;
+        if (dots < 0) {
+          start = Integer.parseInt(codePoints, 16);
+          end = start;
+        } else {
+          start = Integer.parseInt(codePoints.substring(0, dots), 16);
+          end = Integer.parseInt(codePoints.substring(dots + 2), 16);
+        }
+        assign(start, end, ordinal, bmp, supplementary);
+      }
+    }
+  }
+
+  private static void assign(int start, int end, byte ordinal, byte[] bmp, 
List<int[]> supplementary) {
+    final int bmpEnd = Math.min(end, 0xFFFF);
+    if (start <= bmpEnd) {
+      Arrays.fill(bmp, start, bmpEnd + 1, ordinal); // bulk fill the BMP 
portion of the range
+    }
+    if (end > 0xFFFF) {
+      supplementary.add(new int[] {Math.max(start, 0x10000), end, ordinal});
+    }
+  }
+
+  /**
+   * {@return the {@link WordBreak} value of a code point}
+   *
+   * @param codePoint The code point. Values outside {@code [0, U+10FFFF]} 
return
+   *     {@link WordBreak#OTHER}.
+   */
+  public static WordBreak of(int codePoint) {
+    return of(data(), codePoint);
+  }
+
+  // Package-visible overload for a caller that already resolved Data once for 
a whole pass (see
+  // ordinalOf(Data, int)), so it is not looked up again per code point.
+  static WordBreak of(Data resolved, int codePoint) {
+    return VALUES[ordinalOf(resolved, codePoint)];
+  }
+
+  /**
+   * {@return the {@link WordBreak#ordinal() ordinal} of a code point's {@link 
WordBreak} value}
+   * This is the allocation-free form of {@link #of(int)} for hot loops that 
work with ordinals.
+   *
+   * @param codePoint The code point. Values outside {@code [0, U+10FFFF]} 
return the ordinal of
+   *     {@link WordBreak#OTHER}.
+   */
+  public static int ordinalOf(int codePoint) {
+    return ordinalOf(data(), codePoint);
+  }
+
+  // Package-visible overload taking an already-resolved Data (see data()), so 
a caller that looks up
+  // many code points in one pass (WordSegmenter) pays the volatile read 
behind data() once for the
+  // whole pass rather than once per code point.
+  static int ordinalOf(Data resolved, int codePoint) {

Review Comment:
   Please add proper Javadoc here.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * A word tokenizer built on the Unicode Text Segmentation algorithm (UAX 
#29). It finds segments
+ * with {@link WordSegmenter}, keeps the ones that are words (letters, digits, 
ideographs, kana,
+ * Hangul, Southeast-Asian script, or emoji), drops whitespace and 
punctuation, and classifies each
+ * kept token with a {@link WordType}. Emoji here means any {@code 
Extended_Pictographic} code point,
+ * so symbol-like characters such as the copyright, trademark, and 
double-exclamation signs are kept
+ * (typed {@link WordType#EMOJI}) rather than dropped as punctuation.
+ *
+ * <p>A token longer than {@code maxTokenLength} is emitted as consecutive 
pieces, never splitting a
+ * surrogate pair. The tokenizer reports offset {@link Span}s, so the original 
text and its character
+ * offsets are preserved for downstream normalization.</p>
+ *
+ * <p>It implements {@link Tokenizer}: {@link #tokenize(String)} returns the 
token strings and
+ * {@link #tokenizePos(String)} their offsets. {@link 
#tokenizeTyped(CharSequence)} additionally
+ * carries each token's {@link WordType}, and {@link #tokenize(CharSequence, 
TokenHandler)} streams
+ * tokens with no per-token allocation. Instances are immutable and 
thread-safe.</p>
+ */
+// Implements Tokenizer directly rather than extending AbstractTokenizer: this 
tokenizer produces
+// its spans from the UAX #29 segmenter in one pass and shares none of 
AbstractTokenizer's
+// per-character probability/merge machinery, so subclassing it would only add 
unused state.
+public final class WordTokenizer implements Tokenizer {
+
+  /** Receives each word token as a character range and its type, with no 
allocation. */
+  @FunctionalInterface
+  public interface TokenHandler {
+    /**
+     * Accepts one word token.
+     *
+     * @param start The inclusive start character offset.
+     * @param end   The exclusive end character offset.
+     * @param type  The token category.
+     */
+    void token(int start, int end, WordType type);
+  }
+
+  /** The default maximum token length. */
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+
+  private final int maxTokenLength;
+
+  /**
+   * Creates a tokenizer with the {@linkplain #DEFAULT_MAX_TOKEN_LENGTH 
default} maximum token
+   * length.
+   */
+  public WordTokenizer() {
+    this(DEFAULT_MAX_TOKEN_LENGTH);
+  }
+
+  /**
+   * Creates a tokenizer with the given maximum token length.
+   *
+   * @param maxTokenLength The maximum number of characters in a token; longer 
tokens are chopped
+   *                       into consecutive pieces. Must be at least {@code 1}.
+   * @throws IllegalArgumentException if {@code maxTokenLength} is less than 
{@code 1}.
+   */
+  public WordTokenizer(int maxTokenLength) {
+    if (maxTokenLength < 1) {
+      throw new IllegalArgumentException("maxTokenLength must be at least 1, 
got " + maxTokenLength);
+    }
+    this.maxTokenLength = maxTokenLength;
+  }
+
+  /**
+   * Streams the word tokens of {@code text} to {@code handler} in order, 
allocating nothing.
+   *
+   * @param text    The text to tokenize.
+   * @param handler The receiver of the tokens.
+   */
+  public void tokenize(CharSequence text, TokenHandler handler) {
+    WordSegmenter.forEachSegment(text, (start, end) -> {
+      final WordType type = WordType.of(text, start, end);
+      if (type != null) {
+        emit(text, start, end, type, handler);
+      }
+    });
+  }
+
+  /**
+   * Returns the word tokens of {@code s} as strings, in order.
+   *
+   * @param s The text to tokenize.
+   * @return The token strings.
+   */
+  @Override
+  public String[] tokenize(String s) {

Review Comment:
   Rename the parameter `s` to `text` here.
   
   Add a parameter check for `text`. This parameter should not be null. If it 
is empty, we can return early.
   As a consequence, we need to add Javadoc for the resulting 
"IllegalArgumentException" which should be thrown if parameters are invalid.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * A word tokenizer built on the Unicode Text Segmentation algorithm (UAX 
#29). It finds segments
+ * with {@link WordSegmenter}, keeps the ones that are words (letters, digits, 
ideographs, kana,
+ * Hangul, Southeast-Asian script, or emoji), drops whitespace and 
punctuation, and classifies each
+ * kept token with a {@link WordType}. Emoji here means any {@code 
Extended_Pictographic} code point,
+ * so symbol-like characters such as the copyright, trademark, and 
double-exclamation signs are kept
+ * (typed {@link WordType#EMOJI}) rather than dropped as punctuation.
+ *
+ * <p>A token longer than {@code maxTokenLength} is emitted as consecutive 
pieces, never splitting a
+ * surrogate pair. The tokenizer reports offset {@link Span}s, so the original 
text and its character
+ * offsets are preserved for downstream normalization.</p>
+ *
+ * <p>It implements {@link Tokenizer}: {@link #tokenize(String)} returns the 
token strings and
+ * {@link #tokenizePos(String)} their offsets. {@link 
#tokenizeTyped(CharSequence)} additionally
+ * carries each token's {@link WordType}, and {@link #tokenize(CharSequence, 
TokenHandler)} streams
+ * tokens with no per-token allocation. Instances are immutable and 
thread-safe.</p>
+ */
+// Implements Tokenizer directly rather than extending AbstractTokenizer: this 
tokenizer produces
+// its spans from the UAX #29 segmenter in one pass and shares none of 
AbstractTokenizer's
+// per-character probability/merge machinery, so subclassing it would only add 
unused state.
+public final class WordTokenizer implements Tokenizer {
+
+  /** Receives each word token as a character range and its type, with no 
allocation. */
+  @FunctionalInterface
+  public interface TokenHandler {
+    /**
+     * Accepts one word token.
+     *
+     * @param start The inclusive start character offset.
+     * @param end   The exclusive end character offset.
+     * @param type  The token category.
+     */
+    void token(int start, int end, WordType type);
+  }
+
+  /** The default maximum token length. */
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+
+  private final int maxTokenLength;
+
+  /**
+   * Creates a tokenizer with the {@linkplain #DEFAULT_MAX_TOKEN_LENGTH 
default} maximum token
+   * length.
+   */
+  public WordTokenizer() {
+    this(DEFAULT_MAX_TOKEN_LENGTH);
+  }
+
+  /**
+   * Creates a tokenizer with the given maximum token length.
+   *
+   * @param maxTokenLength The maximum number of characters in a token; longer 
tokens are chopped
+   *                       into consecutive pieces. Must be at least {@code 1}.
+   * @throws IllegalArgumentException if {@code maxTokenLength} is less than 
{@code 1}.
+   */
+  public WordTokenizer(int maxTokenLength) {
+    if (maxTokenLength < 1) {
+      throw new IllegalArgumentException("maxTokenLength must be at least 1, 
got " + maxTokenLength);
+    }
+    this.maxTokenLength = maxTokenLength;
+  }
+
+  /**
+   * Streams the word tokens of {@code text} to {@code handler} in order, 
allocating nothing.
+   *
+   * @param text    The text to tokenize.
+   * @param handler The receiver of the tokens.
+   */
+  public void tokenize(CharSequence text, TokenHandler handler) {
+    WordSegmenter.forEachSegment(text, (start, end) -> {
+      final WordType type = WordType.of(text, start, end);
+      if (type != null) {
+        emit(text, start, end, type, handler);
+      }
+    });
+  }
+
+  /**
+   * Returns the word tokens of {@code s} as strings, in order.
+   *
+   * @param s The text to tokenize.
+   * @return The token strings.
+   */
+  @Override
+  public String[] tokenize(String s) {
+    final List<String> tokens = new ArrayList<>();
+    tokenize(s, (start, end, type) -> tokens.add(s.substring(start, end)));
+    return tokens.toArray(new String[0]);
+  }
+
+  /**
+   * Returns the offset spans of the word tokens of {@code s}, in order.
+   *
+   * @param s The text to tokenize.
+   * @return The token spans.
+   */
+  @Override
+  public Span[] tokenizePos(String s) {
+    final List<Span> spans = tokenizeSpans(s);
+    return spans.toArray(new Span[0]);
+  }
+
+  /**
+   * Returns the offset spans of the word tokens in {@code text}, in order.
+   *
+   * @param text The text to tokenize.
+   * @return The word-token spans.
+   */
+  public List<Span> tokenizeSpans(CharSequence text) {

Review Comment:
   Add a parameter check for `text`. This parameter should not be `null`. If it 
is empty, we can return early.
   As a consequence, we need to add Javadoc for the resulting 
"IllegalArgumentException" which should be thrown if parameters are invalid.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/ExtendedPictographic.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UncheckedIOException;
+import java.nio.charset.StandardCharsets;
+import java.util.BitSet;
+
+/**
+ * Tests the Unicode {@code Extended_Pictographic} property of a code point.

Review Comment:
   Please adjust the first sentence to:
   "Checks the Unicode {@code Extended_Pictographic} property of a code point."
   
   as the verb tests might confuse dev readers audience.
   



##########
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBoundaryConformanceTest.java:
##########
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Runs the official Unicode {@code WordBreakTest.txt} conformance suite 
against
+ * {@link WordSegmenter}. Each line marks boundaries with U+00F7 (division 
sign) and non-boundaries
+ * with U+00D7 (multiplication sign) between code points.
+ */
+public class WordBoundaryConformanceTest {
+
+  private static final int BOUNDARY = 0x00F7; // division sign
+
+  @Test
+  void testOfficialUnicodeWordBreakConformance() throws IOException {
+    int total = 0;
+    int passed = 0;
+    final List<String> failures = new ArrayList<>();
+
+    try (InputStream in = Objects.requireNonNull(

Review Comment:
   Can we please extract the IO specific code (resource loading) to a 
`@BeforeAll` annotated init method of the test class? This way, the actual test 
here is not bloated with this aspect and the resource is prepared before the 
test is actually spawned to life.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/ExtendedPictographic.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UncheckedIOException;
+import java.nio.charset.StandardCharsets;
+import java.util.BitSet;
+
+/**
+ * Tests the Unicode {@code Extended_Pictographic} property of a code point.
+ *
+ * <p>This is the one extra property the word boundary algorithm needs (rule 
WB3c), to keep emoji
+ * zero-width-joiner sequences together. The data is loaded once from the 
{@code emoji-data.txt}
+ * derived resource of the Unicode Character Database and stored in a {@link 
BitSet}, so membership
+ * is an O(1) bit test.</p>
+ */
+public final class ExtendedPictographic {
+
+  private static final String RESOURCE = "ExtendedPictographic.txt";
+
+  private static volatile BitSet members;
+
+  private ExtendedPictographic() {
+  }
+
+  // Package-visible so a per-pass caller can resolve the set once (see 
is(BitSet, int)) rather than
+  // once per code point.
+  static BitSet members() {
+    BitSet set = members;
+    if (set == null) {
+      synchronized (ExtendedPictographic.class) {
+        set = members;
+        if (set == null) {
+          set = load();
+          members = set;
+        }
+      }
+    }
+    return set;
+  }
+
+  private static BitSet load() {
+    final BitSet set = new BitSet();
+    try (InputStream in = 
ExtendedPictographic.class.getResourceAsStream(RESOURCE)) {
+      if (in == null) {
+        throw new IllegalStateException("Missing Extended_Pictographic data 
resource: " + RESOURCE);
+      }
+      parse(in, set);
+    } catch (IOException e) {
+      throw new UncheckedIOException(
+          "Unable to read Extended_Pictographic data resource " + RESOURCE, e);
+    }
+    return set;
+  }
+
+  // Package-visible so the malformed-data handling can be exercised without 
the bundled resource.
+  static void parse(InputStream in, BitSet set) throws IOException {
+    try (BufferedReader reader =
+             new BufferedReader(new InputStreamReader(in, 
StandardCharsets.UTF_8))) {
+      String line;
+      while ((line = reader.readLine()) != null) {
+        final int hash = line.indexOf('#');
+        final String content = (hash < 0 ? line : line.substring(0, 
hash)).strip();
+        if (content.isEmpty()) {
+          continue;
+        }
+        // Only the code-point column is needed; the property value after ';' 
is implicit (this is a
+        // filtered single-property file), so a line with no ';' is taken 
whole -- unlike
+        // WordBreakProperty, whose value column is required.
+        final int semicolon = content.indexOf(';');
+        final String codePoints = (semicolon < 0 ? content : 
content.substring(0, semicolon)).strip();
+        try {
+          final int dots = codePoints.indexOf("..");
+          if (dots < 0) {
+            set.set(Integer.parseInt(codePoints, 16));
+          } else {
+            final int start = Integer.parseInt(codePoints.substring(0, dots), 
16);
+            final int end = Integer.parseInt(codePoints.substring(dots + 2), 
16);
+            set.set(start, end + 1);
+          }
+        } catch (NumberFormatException e) {
+          // Fail loud naming the bad line, the same way the sibling loaders 
do.
+          throw new IllegalArgumentException(
+              "Malformed Extended_Pictographic data in " + RESOURCE + ": " + 
content, e);
+        }
+      }
+    }
+  }
+
+  /**
+   * {@return whether a code point has the {@code Extended_Pictographic} 
property}
+   *
+   * @param codePoint The code point. Values outside {@code [0, U+10FFFF]} 
return {@code false}.
+   */
+  public static boolean is(int codePoint) {
+    return is(members(), codePoint);
+  }
+
+  // Package-visible overload taking an already-resolved BitSet, so a caller 
that tests many code
+  // points in one pass (WordSegmenter, WordType) pays the volatile read 
behind members() once for the
+  // whole pass rather than once per code point.
+  static boolean is(BitSet resolved, int codePoint) {

Review Comment:
   Please add proper Javadoc here.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/ExtendedPictographic.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UncheckedIOException;
+import java.nio.charset.StandardCharsets;
+import java.util.BitSet;
+
+/**
+ * Tests the Unicode {@code Extended_Pictographic} property of a code point.
+ *
+ * <p>This is the one extra property the word boundary algorithm needs (rule 
WB3c), to keep emoji
+ * zero-width-joiner sequences together. The data is loaded once from the 
{@code emoji-data.txt}
+ * derived resource of the Unicode Character Database and stored in a {@link 
BitSet}, so membership
+ * is an O(1) bit test.</p>
+ */
+public final class ExtendedPictographic {
+
+  private static final String RESOURCE = "ExtendedPictographic.txt";
+
+  private static volatile BitSet members;
+
+  private ExtendedPictographic() {
+  }
+
+  // Package-visible so a per-pass caller can resolve the set once (see 
is(BitSet, int)) rather than
+  // once per code point.
+  static BitSet members() {
+    BitSet set = members;
+    if (set == null) {
+      synchronized (ExtendedPictographic.class) {
+        set = members;
+        if (set == null) {
+          set = load();
+          members = set;
+        }
+      }
+    }
+    return set;
+  }
+
+  private static BitSet load() {
+    final BitSet set = new BitSet();
+    try (InputStream in = 
ExtendedPictographic.class.getResourceAsStream(RESOURCE)) {
+      if (in == null) {
+        throw new IllegalStateException("Missing Extended_Pictographic data 
resource: " + RESOURCE);
+      }
+      parse(in, set);
+    } catch (IOException e) {
+      throw new UncheckedIOException(
+          "Unable to read Extended_Pictographic data resource " + RESOURCE, e);
+    }
+    return set;
+  }
+
+  // Package-visible so the malformed-data handling can be exercised without 
the bundled resource.
+  static void parse(InputStream in, BitSet set) throws IOException {

Review Comment:
   Please add proper Javadoc here and mention exceptions that might occur, 
namely: `IllegalArgumentException` and `IOException`.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreakProperty.java:
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UncheckedIOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+/**
+ * Looks up the Unicode {@link WordBreak Word_Break} property of a code point.
+ *
+ * <p>The data is loaded once from the {@code WordBreakProperty.txt} resource 
of the Unicode
+ * Character Database (parsed with simple cursor scanning, no regular 
expression). Lookup is O(1)
+ * for the Basic Multilingual Plane (a direct array index) and O(log n) for 
supplementary code
+ * points (a binary search over a small sorted range table), so it imposes no 
per-character
+ * allocation on the word boundary algorithm.</p>
+ */
+public final class WordBreakProperty {
+
+  private static final String RESOURCE = "WordBreakProperty.txt";
+
+  private static final WordBreak[] VALUES = WordBreak.values();
+
+  private static volatile Data data;
+
+  private WordBreakProperty() {
+  }
+
+  // Immutable Word_Break tables: ordinal per BMP code point, plus 
supplementary ranges sorted by
+  // start for binary search. Package-visible so a caller that looks up many 
code points in one pass
+  // (WordSegmenter) can resolve this once and reuse it, instead of paying the 
volatile read behind
+  // data() on every call.
+  static final class Data {
+    final byte[] bmp;
+    final int[] supplementaryStart;
+    final int[] supplementaryEnd;
+    final byte[] supplementaryValue;
+
+    Data(byte[] bmp, int[] start, int[] end, byte[] value) {
+      this.bmp = bmp;
+      this.supplementaryStart = start;
+      this.supplementaryEnd = end;
+      this.supplementaryValue = value;
+    }
+  }
+
+  // Package-visible so a per-pass caller can resolve the table once (see the 
ordinalOf/of overloads
+  // that take a resolved Data) rather than once per code point.
+  static Data data() {

Review Comment:
   Please add proper Javadoc here and mention transitive exceptions that might 
occur, namely: `IllegalStateException` and `UncheckedIOException`.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * A word tokenizer built on the Unicode Text Segmentation algorithm (UAX 
#29). It finds segments
+ * with {@link WordSegmenter}, keeps the ones that are words (letters, digits, 
ideographs, kana,
+ * Hangul, Southeast-Asian script, or emoji), drops whitespace and 
punctuation, and classifies each
+ * kept token with a {@link WordType}. Emoji here means any {@code 
Extended_Pictographic} code point,
+ * so symbol-like characters such as the copyright, trademark, and 
double-exclamation signs are kept
+ * (typed {@link WordType#EMOJI}) rather than dropped as punctuation.
+ *
+ * <p>A token longer than {@code maxTokenLength} is emitted as consecutive 
pieces, never splitting a
+ * surrogate pair. The tokenizer reports offset {@link Span}s, so the original 
text and its character
+ * offsets are preserved for downstream normalization.</p>
+ *
+ * <p>It implements {@link Tokenizer}: {@link #tokenize(String)} returns the 
token strings and
+ * {@link #tokenizePos(String)} their offsets. {@link 
#tokenizeTyped(CharSequence)} additionally
+ * carries each token's {@link WordType}, and {@link #tokenize(CharSequence, 
TokenHandler)} streams
+ * tokens with no per-token allocation. Instances are immutable and 
thread-safe.</p>
+ */
+// Implements Tokenizer directly rather than extending AbstractTokenizer: this 
tokenizer produces
+// its spans from the UAX #29 segmenter in one pass and shares none of 
AbstractTokenizer's
+// per-character probability/merge machinery, so subclassing it would only add 
unused state.
+public final class WordTokenizer implements Tokenizer {
+
+  /** Receives each word token as a character range and its type, with no 
allocation. */
+  @FunctionalInterface
+  public interface TokenHandler {
+    /**
+     * Accepts one word token.
+     *
+     * @param start The inclusive start character offset.
+     * @param end   The exclusive end character offset.
+     * @param type  The token category.
+     */
+    void token(int start, int end, WordType type);
+  }
+
+  /** The default maximum token length. */

Review Comment:
   Please adjust to:
   
   /** The default maximum token length: 255 */



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreakProperty.java:
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UncheckedIOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+/**
+ * Looks up the Unicode {@link WordBreak Word_Break} property of a code point.
+ *
+ * <p>The data is loaded once from the {@code WordBreakProperty.txt} resource 
of the Unicode
+ * Character Database (parsed with simple cursor scanning, no regular 
expression). Lookup is O(1)
+ * for the Basic Multilingual Plane (a direct array index) and O(log n) for 
supplementary code
+ * points (a binary search over a small sorted range table), so it imposes no 
per-character
+ * allocation on the word boundary algorithm.</p>
+ */
+public final class WordBreakProperty {
+
+  private static final String RESOURCE = "WordBreakProperty.txt";
+
+  private static final WordBreak[] VALUES = WordBreak.values();
+
+  private static volatile Data data;
+
+  private WordBreakProperty() {
+  }
+
+  // Immutable Word_Break tables: ordinal per BMP code point, plus 
supplementary ranges sorted by
+  // start for binary search. Package-visible so a caller that looks up many 
code points in one pass
+  // (WordSegmenter) can resolve this once and reuse it, instead of paying the 
volatile read behind
+  // data() on every call.
+  static final class Data {
+    final byte[] bmp;
+    final int[] supplementaryStart;
+    final int[] supplementaryEnd;
+    final byte[] supplementaryValue;
+
+    Data(byte[] bmp, int[] start, int[] end, byte[] value) {
+      this.bmp = bmp;
+      this.supplementaryStart = start;
+      this.supplementaryEnd = end;
+      this.supplementaryValue = value;
+    }
+  }
+
+  // Package-visible so a per-pass caller can resolve the table once (see the 
ordinalOf/of overloads
+  // that take a resolved Data) rather than once per code point.
+  static Data data() {
+    Data d = data;
+    if (d == null) {
+      synchronized (WordBreakProperty.class) {
+        d = data;
+        if (d == null) {
+          d = load();
+          data = d;
+        }
+      }
+    }
+    return d;
+  }
+
+  private static Data load() {
+    final byte[] bmp = new byte[0x10000];
+    final List<int[]> supplementary = new ArrayList<>();
+    try (InputStream in = 
WordBreakProperty.class.getResourceAsStream(RESOURCE)) {
+      if (in == null) {
+        throw new IllegalStateException("Missing Word_Break data resource: " + 
RESOURCE);
+      }
+      parse(in, bmp, supplementary);
+    } catch (IOException e) {
+      throw new UncheckedIOException("Unable to read Word_Break data resource 
" + RESOURCE, e);
+    }
+    supplementary.sort((a, b) -> Integer.compare(a[0], b[0]));
+    final int[] start = new int[supplementary.size()];
+    final int[] end = new int[supplementary.size()];
+    final byte[] value = new byte[supplementary.size()];
+    for (int i = 0; i < supplementary.size(); i++) {
+      final int[] range = supplementary.get(i);
+      start[i] = range[0];
+      end[i] = range[1];
+      value[i] = (byte) range[2];
+    }
+    return new Data(bmp, start, end, value);
+  }
+
+  // Package-visible so the malformed-data handling can be exercised without 
the bundled resource.
+  static void parse(InputStream in, byte[] bmp, List<int[]> supplementary) 
throws IOException {

Review Comment:
   Please add proper Javadoc here and mention exceptions that might occur, 
namely: `IllegalStateException` and `IOException`.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordSegmenter.java:
##########
@@ -0,0 +1,416 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.BitSet;
+import java.util.List;
+
+import opennlp.tools.util.Span;
+
+/**
+ * Finds word boundaries in text using the Unicode Text Segmentation algorithm
+ * (<a href="https://www.unicode.org/reports/tr29/";>UAX #29</a>), rules WB1 
through WB999.
+ *
+ * <p>The implementation is a single forward cursor pass with O(1) {@link 
WordBreakProperty}
+ * lookups and no regular expression. It decodes each code point once, keeps 
only a constant amount
+ * of state, and allocates nothing per character. It implements the "ignore" 
semantics of WB4 (a
+ * base character absorbs following {@code Extend}, {@code Format}, and {@code 
ZWJ}), the look-ahead
+ * rules WB6/WB7/WB7b/WB12, the Hebrew quote rules WB7a-WB7c, the emoji 
zero-width-joiner rule WB3c,
+ * and regional-indicator pairing WB15/WB16. The look-ahead for the 
WB6/WB7b/WB12 rules is resolved
+ * lazily and only at mid-word punctuation, so the common case never scans 
ahead.</p>
+ *
+ * <p>{@link #forEachSegment(CharSequence, SegmentConsumer)} streams the 
segments with no
+ * allocation; {@link #boundaries(CharSequence)} returns every boundary offset 
(always including
+ * {@code 0} and the text length); {@link #segments(CharSequence)} returns the 
spans between
+ * them.</p>
+ */
+public final class WordSegmenter {
+
+  /** Receives each word segment as the half-open character range {@code 
[start, end)}. */
+  @FunctionalInterface
+  public interface SegmentConsumer {
+    /**
+     * Accepts one segment.
+     *
+     * @param start The inclusive start character offset.
+     * @param end   The exclusive end character offset.
+     */
+    void accept(int start, int end);
+  }
+
+  // Decisions for the WB5-WB999 rules. NO_BREAK/BREAK are final; CONSULT 
marks a (last, current)
+  // pair whose decision also depends on look-ahead or regional-indicator 
parity, so the full rule
+  // cascade must be consulted. GO_SLOW appears only in the FAST table (never 
in TRANSITION) and
+  // marks a current class that can trigger a WB3-family or WB4 rule.
+  private static final byte NO_BREAK = 0;
+  private static final byte BREAK = 1;
+  private static final byte CONSULT = 2;
+  private static final byte GO_SLOW = 3;
+
+  private static final WordBreak[] CLASSES = WordBreak.values();
+  private static final int CLASS_COUNT = CLASSES.length;
+
+  // TRANSITION[last * CLASS_COUNT + current] holds the WB5-WB999 decision for 
a (last, current)
+  // pair: NO_BREAK or BREAK when the decision is the same for every 
secondLast, next significant
+  // value, and parity, or CONSULT otherwise. The table is derived from 
afterPrefix(...) at
+  // class-load, so it is equivalent to the rule cascade by construction; only 
the hot path reads it.
+  private static final byte[] TRANSITION = buildTransitionTable();
+
+  // Ordinals of the Word_Break classes that the WB3 family and WB4 examine. 
The hot loop works with
+  // ordinals to avoid materializing a WordBreak enum per character.
+  private static final int OTHER_ORDINAL = WordBreak.OTHER.ordinal();
+  private static final int CR_ORDINAL = WordBreak.CR.ordinal();
+  private static final int LF_ORDINAL = WordBreak.LF.ordinal();
+  private static final int NEWLINE_ORDINAL = WordBreak.NEWLINE.ordinal();
+  private static final int ZWJ_ORDINAL = WordBreak.ZWJ.ordinal();
+  private static final int WSEG_SPACE_ORDINAL = WordBreak.WSEG_SPACE.ordinal();
+  private static final int EXTEND_ORDINAL = WordBreak.EXTEND.ordinal();
+  private static final int FORMAT_ORDINAL = WordBreak.FORMAT.ordinal();
+  private static final int REGIONAL_INDICATOR_ORDINAL = 
WordBreak.REGIONAL_INDICATOR.ordinal();
+
+  // SPECIAL[ordinal] is true for the classes that can trigger a WB3-family or 
WB4 rule (the
+  // newline, ZWJ, word-segment-space, and ignorable classes). When neither 
the previous nor the
+  // current class is special, those rules cannot fire and the hot loop goes 
straight to the
+  // transition table.
+  private static final boolean[] SPECIAL = buildSpecialTable();
+
+  // FAST[last * CLASS_COUNT + current] is the hot-loop table: the TRANSITION 
decision when the
+  // current class is ordinary, or GO_SLOW when it is special. One read 
decides the common case and
+  // detects a special current class, so the loop never reloads 
SPECIAL[current].
+  private static final byte[] FAST = buildFastTable();
+
+  private WordSegmenter() {
+  }
+
+  private static boolean[] buildSpecialTable() {
+    final boolean[] special = new boolean[CLASS_COUNT];
+    special[CR_ORDINAL] = true;
+    special[LF_ORDINAL] = true;
+    special[NEWLINE_ORDINAL] = true;
+    special[ZWJ_ORDINAL] = true;
+    special[WSEG_SPACE_ORDINAL] = true;
+    special[EXTEND_ORDINAL] = true;
+    special[FORMAT_ORDINAL] = true;
+    return special;
+  }
+
+  private static byte[] buildFastTable() {
+    final byte[] fast = new byte[CLASS_COUNT * CLASS_COUNT];
+    for (int last = 0; last < CLASS_COUNT; last++) {
+      for (int current = 0; current < CLASS_COUNT; current++) {
+        final int index = last * CLASS_COUNT + current;
+        fast[index] = SPECIAL[current] ? GO_SLOW : TRANSITION[index];
+      }
+    }
+    return fast;
+  }
+
+  private static byte[] buildTransitionTable() {
+    final byte[] table = new byte[CLASS_COUNT * CLASS_COUNT];
+    for (final WordBreak last : CLASSES) {
+      for (final WordBreak current : CLASSES) {
+        table[last.ordinal() * CLASS_COUNT + current.ordinal()] = 
deriveDecision(last, current);
+      }
+    }
+    return table;
+  }
+
+  // Returns the constant WB5-WB999 decision for a (last, current) pair, or 
CONSULT if afterPrefix
+  // gives different answers for different secondLast, next, or parity values.
+  private static byte deriveDecision(WordBreak last, WordBreak current) {
+    Boolean constant = null;
+    for (final WordBreak secondLast : CLASSES) {
+      for (final WordBreak next : CLASSES) {
+        for (int parity = 0; parity <= 1; parity++) {
+          final boolean decision = afterPrefix(current, last, secondLast, 
next, parity);
+          if (constant == null) {
+            constant = decision;
+          } else if (constant != decision) {
+            return CONSULT;
+          }
+        }
+      }
+    }
+    return constant ? BREAK : NO_BREAK;
+  }
+
+  /**
+   * Streams the word segments of {@code text} to {@code consumer} in order, 
allocating nothing.
+   * Each segment is delivered as the half-open character range {@code [start, 
end)}; the segments
+   * are contiguous and together cover the whole text.
+   *
+   * @param text     The text to segment.
+   * @param consumer The receiver of the segment ranges.
+   */
+  public static void forEachSegment(CharSequence text, SegmentConsumer 
consumer) {
+    final int length = text.length();
+    if (length == 0) {
+      return;

Review Comment:
   Mention this in JavaDoc for the parameter `text`.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreakProperty.java:
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UncheckedIOException;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+/**
+ * Looks up the Unicode {@link WordBreak Word_Break} property of a code point.
+ *
+ * <p>The data is loaded once from the {@code WordBreakProperty.txt} resource 
of the Unicode
+ * Character Database (parsed with simple cursor scanning, no regular 
expression). Lookup is O(1)
+ * for the Basic Multilingual Plane (a direct array index) and O(log n) for 
supplementary code
+ * points (a binary search over a small sorted range table), so it imposes no 
per-character
+ * allocation on the word boundary algorithm.</p>
+ */
+public final class WordBreakProperty {
+
+  private static final String RESOURCE = "WordBreakProperty.txt";
+
+  private static final WordBreak[] VALUES = WordBreak.values();
+
+  private static volatile Data data;
+
+  private WordBreakProperty() {
+  }
+
+  // Immutable Word_Break tables: ordinal per BMP code point, plus 
supplementary ranges sorted by
+  // start for binary search. Package-visible so a caller that looks up many 
code points in one pass
+  // (WordSegmenter) can resolve this once and reuse it, instead of paying the 
volatile read behind
+  // data() on every call.
+  static final class Data {
+    final byte[] bmp;
+    final int[] supplementaryStart;
+    final int[] supplementaryEnd;
+    final byte[] supplementaryValue;
+
+    Data(byte[] bmp, int[] start, int[] end, byte[] value) {
+      this.bmp = bmp;
+      this.supplementaryStart = start;
+      this.supplementaryEnd = end;
+      this.supplementaryValue = value;
+    }
+  }
+
+  // Package-visible so a per-pass caller can resolve the table once (see the 
ordinalOf/of overloads
+  // that take a resolved Data) rather than once per code point.
+  static Data data() {
+    Data d = data;
+    if (d == null) {
+      synchronized (WordBreakProperty.class) {
+        d = data;
+        if (d == null) {
+          d = load();
+          data = d;
+        }
+      }
+    }
+    return d;
+  }
+
+  private static Data load() {
+    final byte[] bmp = new byte[0x10000];
+    final List<int[]> supplementary = new ArrayList<>();
+    try (InputStream in = 
WordBreakProperty.class.getResourceAsStream(RESOURCE)) {
+      if (in == null) {
+        throw new IllegalStateException("Missing Word_Break data resource: " + 
RESOURCE);
+      }
+      parse(in, bmp, supplementary);
+    } catch (IOException e) {
+      throw new UncheckedIOException("Unable to read Word_Break data resource 
" + RESOURCE, e);
+    }
+    supplementary.sort((a, b) -> Integer.compare(a[0], b[0]));
+    final int[] start = new int[supplementary.size()];
+    final int[] end = new int[supplementary.size()];
+    final byte[] value = new byte[supplementary.size()];
+    for (int i = 0; i < supplementary.size(); i++) {
+      final int[] range = supplementary.get(i);
+      start[i] = range[0];
+      end[i] = range[1];
+      value[i] = (byte) range[2];
+    }
+    return new Data(bmp, start, end, value);
+  }
+
+  // Package-visible so the malformed-data handling can be exercised without 
the bundled resource.
+  static void parse(InputStream in, byte[] bmp, List<int[]> supplementary) 
throws IOException {
+    try (BufferedReader reader =
+             new BufferedReader(new InputStreamReader(in, 
StandardCharsets.UTF_8))) {
+      String line;
+      while ((line = reader.readLine()) != null) {
+        final int hash = line.indexOf('#');
+        final String content = (hash < 0 ? line : line.substring(0, 
hash)).strip();
+        if (content.isEmpty()) {
+          continue;
+        }
+        final int semicolon = content.indexOf(';');
+        if (semicolon < 0) {
+          // A present-but-structurally-wrong line (no ';' to split code 
points from the value) is a
+          // hard error naming the line, mirroring ExtendedPictographic, not 
an opaque substring throw.
+          throw new IllegalStateException(
+              "Malformed Word_Break data in " + RESOURCE + " (no ';'): " + 
content);
+        }
+        final String codePoints = content.substring(0, semicolon).strip();
+        final String value = content.substring(semicolon + 1).strip();
+        final byte ordinal = (byte) 
WordBreak.fromPropertyName(value).ordinal();
+
+        final int dots = codePoints.indexOf("..");
+        final int start;
+        final int end;
+        if (dots < 0) {
+          start = Integer.parseInt(codePoints, 16);
+          end = start;
+        } else {
+          start = Integer.parseInt(codePoints.substring(0, dots), 16);
+          end = Integer.parseInt(codePoints.substring(dots + 2), 16);
+        }
+        assign(start, end, ordinal, bmp, supplementary);
+      }
+    }
+  }
+
+  private static void assign(int start, int end, byte ordinal, byte[] bmp, 
List<int[]> supplementary) {

Review Comment:
   Please add proper Javadoc here. Provide information on illegal parameter 
values, that is lower and upper boundaries.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * A word tokenizer built on the Unicode Text Segmentation algorithm (UAX 
#29). It finds segments
+ * with {@link WordSegmenter}, keeps the ones that are words (letters, digits, 
ideographs, kana,
+ * Hangul, Southeast-Asian script, or emoji), drops whitespace and 
punctuation, and classifies each
+ * kept token with a {@link WordType}. Emoji here means any {@code 
Extended_Pictographic} code point,
+ * so symbol-like characters such as the copyright, trademark, and 
double-exclamation signs are kept
+ * (typed {@link WordType#EMOJI}) rather than dropped as punctuation.
+ *
+ * <p>A token longer than {@code maxTokenLength} is emitted as consecutive 
pieces, never splitting a
+ * surrogate pair. The tokenizer reports offset {@link Span}s, so the original 
text and its character
+ * offsets are preserved for downstream normalization.</p>
+ *
+ * <p>It implements {@link Tokenizer}: {@link #tokenize(String)} returns the 
token strings and
+ * {@link #tokenizePos(String)} their offsets. {@link 
#tokenizeTyped(CharSequence)} additionally
+ * carries each token's {@link WordType}, and {@link #tokenize(CharSequence, 
TokenHandler)} streams
+ * tokens with no per-token allocation. Instances are immutable and 
thread-safe.</p>
+ */
+// Implements Tokenizer directly rather than extending AbstractTokenizer: this 
tokenizer produces
+// its spans from the UAX #29 segmenter in one pass and shares none of 
AbstractTokenizer's
+// per-character probability/merge machinery, so subclassing it would only add 
unused state.
+public final class WordTokenizer implements Tokenizer {
+
+  /** Receives each word token as a character range and its type, with no 
allocation. */
+  @FunctionalInterface
+  public interface TokenHandler {
+    /**
+     * Accepts one word token.
+     *
+     * @param start The inclusive start character offset.
+     * @param end   The exclusive end character offset.
+     * @param type  The token category.
+     */
+    void token(int start, int end, WordType type);
+  }
+
+  /** The default maximum token length. */
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+
+  private final int maxTokenLength;
+
+  /**
+   * Creates a tokenizer with the {@linkplain #DEFAULT_MAX_TOKEN_LENGTH 
default} maximum token
+   * length.
+   */
+  public WordTokenizer() {
+    this(DEFAULT_MAX_TOKEN_LENGTH);
+  }
+
+  /**
+   * Creates a tokenizer with the given maximum token length.
+   *
+   * @param maxTokenLength The maximum number of characters in a token; longer 
tokens are chopped
+   *                       into consecutive pieces. Must be at least {@code 1}.
+   * @throws IllegalArgumentException if {@code maxTokenLength} is less than 
{@code 1}.
+   */
+  public WordTokenizer(int maxTokenLength) {
+    if (maxTokenLength < 1) {
+      throw new IllegalArgumentException("maxTokenLength must be at least 1, 
got " + maxTokenLength);
+    }
+    this.maxTokenLength = maxTokenLength;
+  }
+
+  /**
+   * Streams the word tokens of {@code text} to {@code handler} in order, 
allocating nothing.
+   *
+   * @param text    The text to tokenize.
+   * @param handler The receiver of the tokens.
+   */
+  public void tokenize(CharSequence text, TokenHandler handler) {
+    WordSegmenter.forEachSegment(text, (start, end) -> {

Review Comment:
   Add a parameter check for `text` and `handler`. Those parameters should not 
be null.
   As a consequence, we need to add Javadoc for the resulting 
"IllegalArgumentException" which should be thrown if parameters are invalid.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * A word tokenizer built on the Unicode Text Segmentation algorithm (UAX 
#29). It finds segments
+ * with {@link WordSegmenter}, keeps the ones that are words (letters, digits, 
ideographs, kana,
+ * Hangul, Southeast-Asian script, or emoji), drops whitespace and 
punctuation, and classifies each
+ * kept token with a {@link WordType}. Emoji here means any {@code 
Extended_Pictographic} code point,
+ * so symbol-like characters such as the copyright, trademark, and 
double-exclamation signs are kept
+ * (typed {@link WordType#EMOJI}) rather than dropped as punctuation.
+ *
+ * <p>A token longer than {@code maxTokenLength} is emitted as consecutive 
pieces, never splitting a
+ * surrogate pair. The tokenizer reports offset {@link Span}s, so the original 
text and its character
+ * offsets are preserved for downstream normalization.</p>
+ *
+ * <p>It implements {@link Tokenizer}: {@link #tokenize(String)} returns the 
token strings and
+ * {@link #tokenizePos(String)} their offsets. {@link 
#tokenizeTyped(CharSequence)} additionally
+ * carries each token's {@link WordType}, and {@link #tokenize(CharSequence, 
TokenHandler)} streams
+ * tokens with no per-token allocation. Instances are immutable and 
thread-safe.</p>
+ */
+// Implements Tokenizer directly rather than extending AbstractTokenizer: this 
tokenizer produces
+// its spans from the UAX #29 segmenter in one pass and shares none of 
AbstractTokenizer's
+// per-character probability/merge machinery, so subclassing it would only add 
unused state.
+public final class WordTokenizer implements Tokenizer {
+
+  /** Receives each word token as a character range and its type, with no 
allocation. */
+  @FunctionalInterface
+  public interface TokenHandler {
+    /**
+     * Accepts one word token.
+     *
+     * @param start The inclusive start character offset.
+     * @param end   The exclusive end character offset.
+     * @param type  The token category.
+     */
+    void token(int start, int end, WordType type);
+  }
+
+  /** The default maximum token length. */
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+
+  private final int maxTokenLength;
+
+  /**
+   * Creates a tokenizer with the {@linkplain #DEFAULT_MAX_TOKEN_LENGTH 
default} maximum token

Review Comment:
   Use "Instantiates" instead of "Creates" for the ctor's first sentence of the 
Javadoc.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * A word tokenizer built on the Unicode Text Segmentation algorithm (UAX 
#29). It finds segments
+ * with {@link WordSegmenter}, keeps the ones that are words (letters, digits, 
ideographs, kana,
+ * Hangul, Southeast-Asian script, or emoji), drops whitespace and 
punctuation, and classifies each
+ * kept token with a {@link WordType}. Emoji here means any {@code 
Extended_Pictographic} code point,
+ * so symbol-like characters such as the copyright, trademark, and 
double-exclamation signs are kept
+ * (typed {@link WordType#EMOJI}) rather than dropped as punctuation.
+ *
+ * <p>A token longer than {@code maxTokenLength} is emitted as consecutive 
pieces, never splitting a
+ * surrogate pair. The tokenizer reports offset {@link Span}s, so the original 
text and its character
+ * offsets are preserved for downstream normalization.</p>
+ *
+ * <p>It implements {@link Tokenizer}: {@link #tokenize(String)} returns the 
token strings and
+ * {@link #tokenizePos(String)} their offsets. {@link 
#tokenizeTyped(CharSequence)} additionally
+ * carries each token's {@link WordType}, and {@link #tokenize(CharSequence, 
TokenHandler)} streams
+ * tokens with no per-token allocation. Instances are immutable and 
thread-safe.</p>
+ */
+// Implements Tokenizer directly rather than extending AbstractTokenizer: this 
tokenizer produces
+// its spans from the UAX #29 segmenter in one pass and shares none of 
AbstractTokenizer's
+// per-character probability/merge machinery, so subclassing it would only add 
unused state.
+public final class WordTokenizer implements Tokenizer {
+
+  /** Receives each word token as a character range and its type, with no 
allocation. */
+  @FunctionalInterface
+  public interface TokenHandler {
+    /**
+     * Accepts one word token.
+     *
+     * @param start The inclusive start character offset.
+     * @param end   The exclusive end character offset.
+     * @param type  The token category.
+     */
+    void token(int start, int end, WordType type);
+  }
+
+  /** The default maximum token length. */
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+
+  private final int maxTokenLength;
+
+  /**
+   * Creates a tokenizer with the {@linkplain #DEFAULT_MAX_TOKEN_LENGTH 
default} maximum token
+   * length.
+   */
+  public WordTokenizer() {
+    this(DEFAULT_MAX_TOKEN_LENGTH);
+  }
+
+  /**
+   * Creates a tokenizer with the given maximum token length.

Review Comment:
   Use "Instantiates" instead of "Creates" for the ctor's first sentence of the 
Javadoc.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * A word tokenizer built on the Unicode Text Segmentation algorithm (UAX 
#29). It finds segments
+ * with {@link WordSegmenter}, keeps the ones that are words (letters, digits, 
ideographs, kana,
+ * Hangul, Southeast-Asian script, or emoji), drops whitespace and 
punctuation, and classifies each
+ * kept token with a {@link WordType}. Emoji here means any {@code 
Extended_Pictographic} code point,
+ * so symbol-like characters such as the copyright, trademark, and 
double-exclamation signs are kept
+ * (typed {@link WordType#EMOJI}) rather than dropped as punctuation.
+ *
+ * <p>A token longer than {@code maxTokenLength} is emitted as consecutive 
pieces, never splitting a
+ * surrogate pair. The tokenizer reports offset {@link Span}s, so the original 
text and its character
+ * offsets are preserved for downstream normalization.</p>
+ *
+ * <p>It implements {@link Tokenizer}: {@link #tokenize(String)} returns the 
token strings and
+ * {@link #tokenizePos(String)} their offsets. {@link 
#tokenizeTyped(CharSequence)} additionally
+ * carries each token's {@link WordType}, and {@link #tokenize(CharSequence, 
TokenHandler)} streams
+ * tokens with no per-token allocation. Instances are immutable and 
thread-safe.</p>
+ */
+// Implements Tokenizer directly rather than extending AbstractTokenizer: this 
tokenizer produces
+// its spans from the UAX #29 segmenter in one pass and shares none of 
AbstractTokenizer's
+// per-character probability/merge machinery, so subclassing it would only add 
unused state.
+public final class WordTokenizer implements Tokenizer {
+
+  /** Receives each word token as a character range and its type, with no 
allocation. */
+  @FunctionalInterface
+  public interface TokenHandler {
+    /**
+     * Accepts one word token.
+     *
+     * @param start The inclusive start character offset.
+     * @param end   The exclusive end character offset.
+     * @param type  The token category.
+     */
+    void token(int start, int end, WordType type);
+  }
+
+  /** The default maximum token length. */
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+
+  private final int maxTokenLength;
+
+  /**
+   * Creates a tokenizer with the {@linkplain #DEFAULT_MAX_TOKEN_LENGTH 
default} maximum token
+   * length.
+   */
+  public WordTokenizer() {
+    this(DEFAULT_MAX_TOKEN_LENGTH);
+  }
+
+  /**
+   * Creates a tokenizer with the given maximum token length.
+   *
+   * @param maxTokenLength The maximum number of characters in a token; longer 
tokens are chopped
+   *                       into consecutive pieces. Must be at least {@code 1}.
+   * @throws IllegalArgumentException if {@code maxTokenLength} is less than 
{@code 1}.

Review Comment:
   Please adjust to:
   
   "Thrown if {@code maxTokenLength} is equal to or less than {@code 0}."



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * A word tokenizer built on the Unicode Text Segmentation algorithm (UAX 
#29). It finds segments
+ * with {@link WordSegmenter}, keeps the ones that are words (letters, digits, 
ideographs, kana,
+ * Hangul, Southeast-Asian script, or emoji), drops whitespace and 
punctuation, and classifies each
+ * kept token with a {@link WordType}. Emoji here means any {@code 
Extended_Pictographic} code point,
+ * so symbol-like characters such as the copyright, trademark, and 
double-exclamation signs are kept
+ * (typed {@link WordType#EMOJI}) rather than dropped as punctuation.
+ *
+ * <p>A token longer than {@code maxTokenLength} is emitted as consecutive 
pieces, never splitting a
+ * surrogate pair. The tokenizer reports offset {@link Span}s, so the original 
text and its character
+ * offsets are preserved for downstream normalization.</p>
+ *
+ * <p>It implements {@link Tokenizer}: {@link #tokenize(String)} returns the 
token strings and
+ * {@link #tokenizePos(String)} their offsets. {@link 
#tokenizeTyped(CharSequence)} additionally
+ * carries each token's {@link WordType}, and {@link #tokenize(CharSequence, 
TokenHandler)} streams
+ * tokens with no per-token allocation. Instances are immutable and 
thread-safe.</p>
+ */
+// Implements Tokenizer directly rather than extending AbstractTokenizer: this 
tokenizer produces
+// its spans from the UAX #29 segmenter in one pass and shares none of 
AbstractTokenizer's
+// per-character probability/merge machinery, so subclassing it would only add 
unused state.
+public final class WordTokenizer implements Tokenizer {
+
+  /** Receives each word token as a character range and its type, with no 
allocation. */
+  @FunctionalInterface
+  public interface TokenHandler {
+    /**
+     * Accepts one word token.
+     *
+     * @param start The inclusive start character offset.
+     * @param end   The exclusive end character offset.
+     * @param type  The token category.
+     */
+    void token(int start, int end, WordType type);
+  }
+
+  /** The default maximum token length. */
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+
+  private final int maxTokenLength;
+
+  /**
+   * Creates a tokenizer with the {@linkplain #DEFAULT_MAX_TOKEN_LENGTH 
default} maximum token
+   * length.
+   */
+  public WordTokenizer() {
+    this(DEFAULT_MAX_TOKEN_LENGTH);
+  }
+
+  /**
+   * Creates a tokenizer with the given maximum token length.
+   *
+   * @param maxTokenLength The maximum number of characters in a token; longer 
tokens are chopped
+   *                       into consecutive pieces. Must be at least {@code 1}.
+   * @throws IllegalArgumentException if {@code maxTokenLength} is less than 
{@code 1}.
+   */
+  public WordTokenizer(int maxTokenLength) {
+    if (maxTokenLength < 1) {
+      throw new IllegalArgumentException("maxTokenLength must be at least 1, 
got " + maxTokenLength);
+    }
+    this.maxTokenLength = maxTokenLength;
+  }
+
+  /**
+   * Streams the word tokens of {@code text} to {@code handler} in order, 
allocating nothing.
+   *
+   * @param text    The text to tokenize.
+   * @param handler The receiver of the tokens.
+   */
+  public void tokenize(CharSequence text, TokenHandler handler) {
+    WordSegmenter.forEachSegment(text, (start, end) -> {
+      final WordType type = WordType.of(text, start, end);
+      if (type != null) {
+        emit(text, start, end, type, handler);
+      }
+    });
+  }
+
+  /**
+   * Returns the word tokens of {@code s} as strings, in order.
+   *
+   * @param s The text to tokenize.
+   * @return The token strings.
+   */
+  @Override
+  public String[] tokenize(String s) {
+    final List<String> tokens = new ArrayList<>();
+    tokenize(s, (start, end, type) -> tokens.add(s.substring(start, end)));
+    return tokens.toArray(new String[0]);
+  }
+
+  /**
+   * Returns the offset spans of the word tokens of {@code s}, in order.
+   *
+   * @param s The text to tokenize.
+   * @return The token spans.
+   */
+  @Override
+  public Span[] tokenizePos(String s) {

Review Comment:
   Rename the parameter `s` to `text` here.
   
   Add a parameter check for text. This parameter should not be null. If it is 
empty, we can return early.
   As a consequence, we need to add Javadoc for the resulting 
"IllegalArgumentException" which should be thrown if parameters are invalid.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * A word tokenizer built on the Unicode Text Segmentation algorithm (UAX 
#29). It finds segments
+ * with {@link WordSegmenter}, keeps the ones that are words (letters, digits, 
ideographs, kana,
+ * Hangul, Southeast-Asian script, or emoji), drops whitespace and 
punctuation, and classifies each
+ * kept token with a {@link WordType}. Emoji here means any {@code 
Extended_Pictographic} code point,
+ * so symbol-like characters such as the copyright, trademark, and 
double-exclamation signs are kept
+ * (typed {@link WordType#EMOJI}) rather than dropped as punctuation.
+ *
+ * <p>A token longer than {@code maxTokenLength} is emitted as consecutive 
pieces, never splitting a
+ * surrogate pair. The tokenizer reports offset {@link Span}s, so the original 
text and its character
+ * offsets are preserved for downstream normalization.</p>
+ *
+ * <p>It implements {@link Tokenizer}: {@link #tokenize(String)} returns the 
token strings and
+ * {@link #tokenizePos(String)} their offsets. {@link 
#tokenizeTyped(CharSequence)} additionally
+ * carries each token's {@link WordType}, and {@link #tokenize(CharSequence, 
TokenHandler)} streams
+ * tokens with no per-token allocation. Instances are immutable and 
thread-safe.</p>
+ */
+// Implements Tokenizer directly rather than extending AbstractTokenizer: this 
tokenizer produces
+// its spans from the UAX #29 segmenter in one pass and shares none of 
AbstractTokenizer's
+// per-character probability/merge machinery, so subclassing it would only add 
unused state.
+public final class WordTokenizer implements Tokenizer {
+
+  /** Receives each word token as a character range and its type, with no 
allocation. */
+  @FunctionalInterface
+  public interface TokenHandler {
+    /**
+     * Accepts one word token.
+     *
+     * @param start The inclusive start character offset.
+     * @param end   The exclusive end character offset.
+     * @param type  The token category.
+     */
+    void token(int start, int end, WordType type);
+  }
+
+  /** The default maximum token length. */
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+
+  private final int maxTokenLength;
+
+  /**
+   * Creates a tokenizer with the {@linkplain #DEFAULT_MAX_TOKEN_LENGTH 
default} maximum token
+   * length.
+   */
+  public WordTokenizer() {
+    this(DEFAULT_MAX_TOKEN_LENGTH);
+  }
+
+  /**
+   * Creates a tokenizer with the given maximum token length.
+   *
+   * @param maxTokenLength The maximum number of characters in a token; longer 
tokens are chopped
+   *                       into consecutive pieces. Must be at least {@code 1}.
+   * @throws IllegalArgumentException if {@code maxTokenLength} is less than 
{@code 1}.
+   */
+  public WordTokenizer(int maxTokenLength) {
+    if (maxTokenLength < 1) {
+      throw new IllegalArgumentException("maxTokenLength must be at least 1, 
got " + maxTokenLength);
+    }
+    this.maxTokenLength = maxTokenLength;
+  }
+
+  /**
+   * Streams the word tokens of {@code text} to {@code handler} in order, 
allocating nothing.
+   *
+   * @param text    The text to tokenize.
+   * @param handler The receiver of the tokens.
+   */
+  public void tokenize(CharSequence text, TokenHandler handler) {
+    WordSegmenter.forEachSegment(text, (start, end) -> {
+      final WordType type = WordType.of(text, start, end);
+      if (type != null) {
+        emit(text, start, end, type, handler);
+      }
+    });
+  }
+
+  /**
+   * Returns the word tokens of {@code s} as strings, in order.
+   *
+   * @param s The text to tokenize.
+   * @return The token strings.
+   */
+  @Override
+  public String[] tokenize(String s) {
+    final List<String> tokens = new ArrayList<>();
+    tokenize(s, (start, end, type) -> tokens.add(s.substring(start, end)));
+    return tokens.toArray(new String[0]);
+  }
+
+  /**
+   * Returns the offset spans of the word tokens of {@code s}, in order.
+   *
+   * @param s The text to tokenize.
+   * @return The token spans.
+   */
+  @Override
+  public Span[] tokenizePos(String s) {
+    final List<Span> spans = tokenizeSpans(s);
+    return spans.toArray(new Span[0]);
+  }
+
+  /**
+   * Returns the offset spans of the word tokens in {@code text}, in order.

Review Comment:
   Replace "Returns" with "Computes" in the first sentence.



##########
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordTokenizerTest.java:
##########
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.List;
+
+import org.junit.jupiter.api.Test;
+
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+
+public class WordTokenizerTest {

Review Comment:
   Please also add a new test class that covers `WordType`, currently it is 
missing.
   
   Please also add a new test class that covers `WordBreak`, currently it is 
missing.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java:
##########
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * A word tokenizer built on the Unicode Text Segmentation algorithm (UAX 
#29). It finds segments
+ * with {@link WordSegmenter}, keeps the ones that are words (letters, digits, 
ideographs, kana,
+ * Hangul, Southeast-Asian script, or emoji), drops whitespace and 
punctuation, and classifies each
+ * kept token with a {@link WordType}. Emoji here means any {@code 
Extended_Pictographic} code point,
+ * so symbol-like characters such as the copyright, trademark, and 
double-exclamation signs are kept
+ * (typed {@link WordType#EMOJI}) rather than dropped as punctuation.
+ *
+ * <p>A token longer than {@code maxTokenLength} is emitted as consecutive 
pieces, never splitting a
+ * surrogate pair. The tokenizer reports offset {@link Span}s, so the original 
text and its character
+ * offsets are preserved for downstream normalization.</p>
+ *
+ * <p>It implements {@link Tokenizer}: {@link #tokenize(String)} returns the 
token strings and
+ * {@link #tokenizePos(String)} their offsets. {@link 
#tokenizeTyped(CharSequence)} additionally
+ * carries each token's {@link WordType}, and {@link #tokenize(CharSequence, 
TokenHandler)} streams
+ * tokens with no per-token allocation. Instances are immutable and 
thread-safe.</p>
+ */
+// Implements Tokenizer directly rather than extending AbstractTokenizer: this 
tokenizer produces
+// its spans from the UAX #29 segmenter in one pass and shares none of 
AbstractTokenizer's
+// per-character probability/merge machinery, so subclassing it would only add 
unused state.
+public final class WordTokenizer implements Tokenizer {
+
+  /** Receives each word token as a character range and its type, with no 
allocation. */
+  @FunctionalInterface
+  public interface TokenHandler {
+    /**
+     * Accepts one word token.
+     *
+     * @param start The inclusive start character offset.
+     * @param end   The exclusive end character offset.
+     * @param type  The token category.
+     */
+    void token(int start, int end, WordType type);
+  }
+
+  /** The default maximum token length. */
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+
+  private final int maxTokenLength;
+
+  /**
+   * Creates a tokenizer with the {@linkplain #DEFAULT_MAX_TOKEN_LENGTH 
default} maximum token
+   * length.
+   */
+  public WordTokenizer() {
+    this(DEFAULT_MAX_TOKEN_LENGTH);
+  }
+
+  /**
+   * Creates a tokenizer with the given maximum token length.
+   *
+   * @param maxTokenLength The maximum number of characters in a token; longer 
tokens are chopped
+   *                       into consecutive pieces. Must be at least {@code 1}.
+   * @throws IllegalArgumentException if {@code maxTokenLength} is less than 
{@code 1}.
+   */
+  public WordTokenizer(int maxTokenLength) {
+    if (maxTokenLength < 1) {
+      throw new IllegalArgumentException("maxTokenLength must be at least 1, 
got " + maxTokenLength);
+    }
+    this.maxTokenLength = maxTokenLength;
+  }
+
+  /**
+   * Streams the word tokens of {@code text} to {@code handler} in order, 
allocating nothing.
+   *
+   * @param text    The text to tokenize.
+   * @param handler The receiver of the tokens.
+   */
+  public void tokenize(CharSequence text, TokenHandler handler) {
+    WordSegmenter.forEachSegment(text, (start, end) -> {
+      final WordType type = WordType.of(text, start, end);
+      if (type != null) {
+        emit(text, start, end, type, handler);
+      }
+    });
+  }
+
+  /**
+   * Returns the word tokens of {@code s} as strings, in order.
+   *
+   * @param s The text to tokenize.
+   * @return The token strings.
+   */
+  @Override
+  public String[] tokenize(String s) {
+    final List<String> tokens = new ArrayList<>();
+    tokenize(s, (start, end, type) -> tokens.add(s.substring(start, end)));
+    return tokens.toArray(new String[0]);
+  }
+
+  /**
+   * Returns the offset spans of the word tokens of {@code s}, in order.
+   *
+   * @param s The text to tokenize.
+   * @return The token spans.
+   */
+  @Override
+  public Span[] tokenizePos(String s) {
+    final List<Span> spans = tokenizeSpans(s);
+    return spans.toArray(new Span[0]);
+  }
+
+  /**
+   * Returns the offset spans of the word tokens in {@code text}, in order.
+   *
+   * @param text The text to tokenize.
+   * @return The word-token spans.
+   */
+  public List<Span> tokenizeSpans(CharSequence text) {
+    final List<Span> spans = new ArrayList<>();
+    tokenize(text, (start, end, type) -> spans.add(new Span(start, end)));
+    return spans;
+  }
+
+  /**
+   * Returns the word tokens in {@code text}, each carrying its {@link 
WordType}, in order.
+   *
+   * @param text The text to tokenize.
+   * @return The typed word tokens.
+   */
+  public List<WordToken> tokenizeTyped(CharSequence text) {

Review Comment:
   Add a parameter check for `text`. This parameter should not be null. If it 
is empty, we can return early.
   As a consequence, we need to add Javadoc for the resulting 
"IllegalArgumentException" which should be thrown if parameters are invalid.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordType.java:
##########
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.util.BitSet;
+
+/**
+ * The category of a {@linkplain WordTokenizer word token}. {@link 
#ALPHANUMERIC} and
+ * {@link #NUMERIC} cover letter and digit words; the remaining categories 
identify scripts and
+ * emoji that benefit from script-specific handling. The boundaries themselves 
follow the Unicode
+ * release shipped with {@link WordSegmenter}.
+ */
+public enum WordType {
+
+  /** A token that contains at least one letter (optionally mixed with digits 
and connectors). */
+  ALPHANUMERIC,
+
+  /** A token made up entirely of digits and numeric connectors. */
+  NUMERIC,
+
+  /** A token containing a Han ideograph (one ideograph per token under UAX 
#29 segmentation). */
+  IDEOGRAPHIC,
+
+  /** A Hiragana token. */
+  HIRAGANA,
+
+  /** A Katakana token. */
+  KATAKANA,
+
+  /** A Hangul token. */
+  HANGUL,
+
+  /** A token in a Southeast Asian script that requires dictionary 
segmentation (Thai, Lao, ...). */
+  SOUTHEAST_ASIAN,
+
+  /** An emoji, emoji sequence, or regional-indicator flag. */
+  EMOJI;
+
+  private static final int REGIONAL_INDICATOR_FIRST = 0x1F1E6;
+  private static final int REGIONAL_INDICATOR_LAST = 0x1F1FF;
+
+  // No code point below this can belong to a script-specific category (the 
lowest is Thai, U+0E00),
+  // so Latin, Greek, Cyrillic, and ASCII text skips the relatively costly 
script lookup entirely.
+  private static final int LOWEST_SCRIPT_CODE_POINT = 0x0E00;
+
+  // ASCII kind: 0 = neither, 1 = letter, 2 = digit. No ASCII code point is 
pictographic or in a
+  // script-specific category, so ASCII characters skip those tests and the 
Character.isLetter /
+  // isDigit general-category look-ups entirely.
+  private static final byte[] ASCII_KIND = buildAsciiKind();
+
+  private static byte[] buildAsciiKind() {
+    final byte[] kind = new byte[0x80];
+    for (int c = '0'; c <= '9'; c++) {
+      kind[c] = 2;
+    }
+    for (int c = 'A'; c <= 'Z'; c++) {
+      kind[c] = 1;
+    }
+    for (int c = 'a'; c <= 'z'; c++) {
+      kind[c] = 1;
+    }
+    return kind;
+  }
+
+  // Classifies the code points in text over [start, end) as a word token 
type, or returns null
+  // when the range is not a word (pure whitespace, punctuation, or symbols). 
Emoji win over
+  // scripts, scripts over the generic alphanumeric/numeric split. The script 
category is taken from
+  // the first script code point in the range; UAX #29 word segments are 
single-script in practice, so
+  // for an unusual mixed-script run this reports the leading script, not a 
per-character determination.
+  static WordType of(CharSequence text, int start, int end) {

Review Comment:
   Please convert this to proper Javadoc comment. Also mention potential 
transitive exceptions that could be thrown from the methods body.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1862: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7) (opennlp)

Reply via email to