[PR] OPENNLP-1838: Adopt BertTokenizer in opennlp-dl components (opennlp)

via GitHub Thu, 11 Jun 2026 21:07:49 -0700


krickert opened a new pull request, #1075:
URL: https://github.com/apache/opennlp/pull/1075


   > [!NOTE]
   > **Stacked on #1072, #1073 and #1074.** This branch includes their commits 
because it depends on `BertTokenizer` from #1073. Once those merge, this PR 
will be rebased onto `main` and the diff collapses to the OPENNLP-1838 changes 
only. Draft until then.
   
   ## What
   
   - `SentenceVectorsDL`, `DocumentCategorizerDL` and `NameFinderDL` now 
tokenize with `BertTokenizer` (basic tokenization / normalization, then 
wordpiece) instead of raw `WordpieceTokenizer`.
   - Lower casing defaults follow each component's commonly used models:
     - `SentenceVectorsDL` and `DocumentCategorizerDL`: **true** (the 
README-recommended models are uncased)
     - `NameFinderDL`: **false** (recommended NER models such as 
`dslim/bert-base-NER` are cased; capitalization is a core signal for entity 
boundaries)
   - Overridable via the new tri-state `InferenceOptions.setLowerCase(boolean)` 
or the new `SentenceVectorsDL(model, vocab, lowerCase)` constructor. 
RoBERTa-style special-token detection is preserved.
   
   ## Why
   
   See [OPENNLP-1838](https://issues.apache.org/jira/browse/OPENNLP-1838) and 
OPENNLP-1837: without basic tokenization, uncased models receive `[UNK]` for 
every capitalized or accented word, severely degrading results.
   
   ## Validation
   
   - New `CreateTokenizerTest` covers lower-casing vs. case-preserving 
tokenizers, RoBERTa special-token selection, and the `InferenceOptions` default 
resolution.
   - Verified end-to-end with the sentence-transformers all-MiniLM-L6-v2 ONNX 
export: the existing `SentenceVectorsDLEval` pinned values still hold, and a 
capitalized variant of the eval sentence now produces vectors identical to the 
lowercase one (previously every capitalized word mapped to `[UNK]`). Added as 
an eval assertion.
   - Verified with the real `dslim/bert-base-NER` vocabulary that all 
`NameFinderDLEval` input sentences tokenize identically before and after this 
change, so the NER eval expectations remain valid.
   
   ## Note for eval-data owners
   
   `DocumentCategorizerDLEval` expectations will shift because lower casing 
changes the tokens sent to the uncased sentiment models (that is the intended 
fix). The pinned values need to be regenerated against the canonical eval-data 
ONNX files, which are not publicly available. Happy to update the pins if 
someone can run the eval and share the new values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] OPENNLP-1838: Adopt BertTokenizer in opennlp-dl components (opennlp)

Reply via email to