krickert commented on PR #1073: URL: https://github.com/apache/opennlp/pull/1073#issuecomment-4680497953
Thanks for the thorough review — all points addressed in the latest commit: **1. Breaking changes:** Agreed on all three. I've added an explicit "As of OpenNLP 3.0.0" paragraph to the `WordpieceTokenizer` class Javadoc listing each behavior change (punctuation-run splitting, whole-word `[UNK]` replacement, `tokenizePos` throwing). Since this branch targets 3.0.0-SNAPSHOT they land in a major release; happy to also tag the JIRA for release notes if that's the convention here. **2. Dependency direction:** Done - the shared character predicates (`isControl`, `isWhitespace`, `isPunctuation`, `isCjk`) and `isolatePunctuation` now live in a package-private `BertNormalization` helper used by both classes, so `WordpieceTokenizer` no longer depends on `BertTokenizer`. **3. `isControl`:** Good catch. I verified against HuggingFace `tokenizers` first (it strips Co/Cn the same way the Python reference does) and widened the check to all `C*` categories (Cc, Cf, Cs, Co, Cn), with a new test covering U+E000 (private use) and U+FDD0 (noncharacter). The full-vocabulary parity harness now passes 13/13, including a sentence with embedded C* characters. **Minor (lowercase/accent coupling):** `lowerCase` intentionally mirrors the reference default coupling (`strip_accents` follows `do_lower_case` unless overridden); this is now noted in the class Javadoc. A decoupled `stripAccents` flag is easy to add later without breaking the constructor surface if a cased+accent-stripped model ever shows up. Also addressed the two Copilot notes: `maxTokenLength` is now validated (`IllegalArgumentException` on negative values) and the `BertTokenizer` constructor null-checks all three special tokens, both with tests. Follow-up JIRA for switching the `opennlp-dl` components to `BertTokenizer` - I'll create the JIRA ticket now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
