Re: [PR] OPENNLP-1837: Add BertTokenizer with BERT basic tokenization (opennlp)

via GitHub Thu, 11 Jun 2026 03:59:21 -0700


Copilot commented on code in PR #1073:
URL: https://github.com/apache/opennlp/pull/1073#discussion_r3395297835



##########
opennlp-api/src/main/java/opennlp/tools/tokenize/WordpieceTokenizer.java:
##########
@@ -113,10 +118,36 @@ public WordpieceTokenizer(
     this.unknownToken = unknownToken;
   }
 
+  /**
+   * Initializes a {@link WordpieceTokenizer} with a {@code vocabulary},
+   * custom special tokens and a custom {@code maxTokenLength}.
+   *
+   * @param vocabulary          The vocabulary.
+   * @param classificationToken The CLS token.
+   * @param separatorToken      The SEP token.
+   * @param unknownToken        The UNK token.
+   * @param maxTokenLength      A non-negative number that is used as maximum 
token length.
+   */
+  public WordpieceTokenizer(
+      final Set<String> vocabulary,
+      final String classificationToken,
+      final String separatorToken,
+      final String unknownToken,
+      final int maxTokenLength) {
+    this(vocabulary, classificationToken, separatorToken, unknownToken);
+    this.maxTokenLength = maxTokenLength;
+  }

Review Comment:
   The constructor accepts a "non-negative" maxTokenLength (per Javadoc) but 
does not validate it. A negative value will silently cause all tokens to be 
treated as over-length and replaced with the unknown token.



##########
opennlp-api/src/main/java/opennlp/tools/tokenize/BertTokenizer.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.tokenize;
+
+import java.text.Normalizer;
+import java.util.Locale;
+import java.util.Objects;
+import java.util.Set;
+
+import opennlp.tools.util.Span;
+
+/**
+ * A {@link Tokenizer} implementation of the full BERT tokenization pipeline:
+ * basic tokenization (text normalization) followed by wordpiece tokenization.
+ * <p>
+ * The basic tokenization stage reproduces the reference BERT
+ * {@code BasicTokenizer}:
+ * <ol>
+ *  <li>Removal of control characters and normalization of all whitespace
+ *      to single spaces.</li>
+ *  <li>Whitespace isolation of CJK ideographs.</li>
+ *  <li>For <i>uncased</i> models: lower casing and accent stripping
+ *      (Unicode NFD decomposition with removal of combining marks).</li>
+ *  <li>Isolation of every punctuation character as its own token.</li>
+ * </ol>
+ * The normalized text is then split into subwords by a
+ * {@link WordpieceTokenizer} sharing the same vocabulary and special tokens.
+ * <p>
+ * This pipeline is required for correct results with BERT-style models:
+ * feeding raw text directly to {@link WordpieceTokenizer} maps every token
+ * that does not literally appear in the vocabulary - for uncased models that
+ * includes every capitalized word - to the unknown token.
+ * <p>
+ * Whether to use the lower casing variant is a property of the model: uncased
+ * models (for example {@code bert-base-uncased} and the
+ * {@code sentence-transformers} models derived from it) require it, cased
+ * models must not use it.
+ * <p>
+ * For reference see:
+ * <ul>
+ *  <li><a href="https://github.com/google-research/bert";>
+ *    https://github.com/google-research/bert</a> ({@code 
tokenization.py})</li>
+ * </ul>
+ *
+ * @see WordpieceTokenizer
+ */
+public class BertTokenizer implements Tokenizer {
+
+  /**
+   * Maximum characters per word before the word is replaced with the unknown
+   * token, matching the reference BERT implementation.
+   */
+  private static final int MAX_WORD_CHARACTERS = 100;
+
+  private final WordpieceTokenizer wordpieceTokenizer;
+  private final boolean lowerCase;
+
+  /**
+   * Initializes a {@link BertTokenizer} for an <i>uncased</i> BERT model,
+   * with lower casing and accent stripping enabled.
+   *
+   * @param vocabulary The wordpiece vocabulary. Must not be {@code null}.
+   */
+  public BertTokenizer(Set<String> vocabulary) {
+    this(vocabulary, true);
+  }
+
+  /**
+   * Initializes a {@link BertTokenizer} with BERT special tokens.
+   *
+   * @param vocabulary The wordpiece vocabulary. Must not be {@code null}.
+   * @param lowerCase  {@code true} for uncased models (lower casing and accent
+   *                   stripping), {@code false} for cased models.
+   */
+  public BertTokenizer(Set<String> vocabulary, boolean lowerCase) {
+    this(vocabulary, lowerCase, WordpieceTokenizer.BERT_CLS_TOKEN,
+        WordpieceTokenizer.BERT_SEP_TOKEN, WordpieceTokenizer.BERT_UNK_TOKEN);
+  }
+
+  /**
+   * Initializes a {@link BertTokenizer} with custom special tokens, for models
+   * like RoBERTa that do not use the BERT defaults.
+   *
+   * @param vocabulary          The wordpiece vocabulary. Must not be {@code 
null}.
+   * @param lowerCase           {@code true} for uncased models (lower casing 
and
+   *                            accent stripping), {@code false} for cased 
models.
+   * @param classificationToken The CLS token.
+   * @param separatorToken      The SEP token.
+   * @param unknownToken        The UNK token.
+   */
+  public BertTokenizer(Set<String> vocabulary, boolean lowerCase,
+      String classificationToken, String separatorToken, String unknownToken) {
+    Objects.requireNonNull(vocabulary, "vocabulary must not be null");
+    this.wordpieceTokenizer = new WordpieceTokenizer(vocabulary,
+        classificationToken, separatorToken, unknownToken, 
MAX_WORD_CHARACTERS);
+    this.lowerCase = lowerCase;

Review Comment:
   This public constructor only null-checks the vocabulary. If any of the 
special token strings are null, failures will happen later in tokenization in 
less obvious places; validate the inputs at construction time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1837: Add BertTokenizer with BERT basic tokenization (opennlp)

Reply via email to