krickert opened a new pull request, #1075:
URL: https://github.com/apache/opennlp/pull/1075
> [!NOTE]
> **Stacked on #1072, #1073 and #1074.** This branch includes their commits
because it depends on `BertTokenizer` from #1073. Once those merge, this PR
will be rebased onto `main` and the diff collapses to the OPENNLP-1838 changes
only. Draft until then.
## What
- `SentenceVectorsDL`, `DocumentCategorizerDL` and `NameFinderDL` now
tokenize with `BertTokenizer` (basic tokenization / normalization, then
wordpiece) instead of raw `WordpieceTokenizer`.
- Lower casing defaults follow each component's commonly used models:
- `SentenceVectorsDL` and `DocumentCategorizerDL`: **true** (the
README-recommended models are uncased)
- `NameFinderDL`: **false** (recommended NER models such as
`dslim/bert-base-NER` are cased; capitalization is a core signal for entity
boundaries)
- Overridable via the new tri-state `InferenceOptions.setLowerCase(boolean)`
or the new `SentenceVectorsDL(model, vocab, lowerCase)` constructor.
RoBERTa-style special-token detection is preserved.
## Why
See [OPENNLP-1838](https://issues.apache.org/jira/browse/OPENNLP-1838) and
OPENNLP-1837: without basic tokenization, uncased models receive `[UNK]` for
every capitalized or accented word, severely degrading results.
## Validation
- New `CreateTokenizerTest` covers lower-casing vs. case-preserving
tokenizers, RoBERTa special-token selection, and the `InferenceOptions` default
resolution.
- Verified end-to-end with the sentence-transformers all-MiniLM-L6-v2 ONNX
export: the existing `SentenceVectorsDLEval` pinned values still hold, and a
capitalized variant of the eval sentence now produces vectors identical to the
lowercase one (previously every capitalized word mapped to `[UNK]`). Added as
an eval assertion.
- Verified with the real `dslim/bert-base-NER` vocabulary that all
`NameFinderDLEval` input sentences tokenize identically before and after this
change, so the NER eval expectations remain valid.
## Note for eval-data owners
`DocumentCategorizerDLEval` expectations will shift because lower casing
changes the tokens sent to the uncased sentiment models (that is the intended
fix). The pinned values need to be regenerated against the canonical eval-data
ONNX files, which are not publicly available. Happy to update the pins if
someone can run the eval and share the new values.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]