Re: [PR] OPENNLP-1850: Improve Whitespace UTF normalization (opennlp)

via GitHub Fri, 19 Jun 2026 17:24:25 -0700


krickert commented on PR #1101:
URL: https://github.com/apache/opennlp/pull/1101#issuecomment-4755649750


   Thanks for the thorough pass.  
   Below I touch on every point. 
   
   It's all pushed now (through `5ab7f873`); and did a first pass.  If you are 
OK with it - I can do 2 smaller PRs on top of this (I think that's easiest - 
but if you want it to be further broken down I can figure it out).
   
   ## Licensing / release plumbing - done
   
   - **NOTICE.template, not just NOTICE.** The Unicode attribution now lives in
     `src/license/NOTICE.template`, so it survives regeneration (same as the 
stopword /
     spellchecker entries). The generated `NOTICE` is updated to match.
   - **rat-excludes.** Added the four bundled `.txt` paths under an 
`OPENNLP-1850` comment
     (`WordBreakProperty.txt`, `ExtendedPictographic.txt`, `confusables.txt`, 
and the
     `WordBreakTest.txt` test fixture).
   - **LICENSE carries the actual text.** Embedded the full Unicode License V3 
in `LICENSE`,
     the way the stopword lists embed their BSD text - since the newer Unicode 
headers only
     link to `terms_of_use.html` rather than inlining it, a URL in NOTICE 
wasn't enough on
     its own.
   - **ExtendedPictographic.txt wording.** Corrected - it's described as 
derived from
     `emoji-data.txt` by keeping only the `Extended_Pictographic` property (451 
lines,
     renamed), not an unmodified copy. The wording also notes the upstream file 
carries the
     other five emoji properties that aren't retained.
   
   ## Architecture
   
   **One token-analysis entry point, one place that defines the steps - done.**
   This wasn't bad at all, it also shrinks the surface:
   
   - `TextAnalyzer` / `AnalyzedToken` are deleted. `TermAnalyzer` / `Term` is 
the single
     entry point - `Term.original()` / `normalized()` / `span()` give you 
exactly what
     `AnalyzedToken` did, on the UAX #29 tokenizer instead of a second one.
   - Each `Dimension` now carries its own default `CharSequenceNormalizer` 
(lazily, so the
     confusables table isn't loaded just by touching the enum). `TermAnalyzer` 
dropped its
     parallel `defaultTransforms()` map, and `TextNormalizer`'s nfc / nfkc / 
whitespace /
     dash / case-fold / accent-fold rungs now read from `Dimension`. The six 
shared rungs are
     defined once. (`TextNormalizer`'s standalone cleanup steps - quotes, 
digits, ellipsis,
     bullets, strip-invisible - I left as-is for now since they already live in 
one place;
     promoting them to `Dimension`s is an easy follow-up if you'd rather 
everything be
     layer-able.  LMK)
   
   **api vs runtime split is completed.** `opennlp-api/util/normalizer` is now 
exactly three files:
   `CharSequenceNormalizer` (the contract) plus the table-free value types 
`NormalizedText`
   and `OffsetMap`. Everything with data or an engine - `CharClass`, 
`CodePointSet`,
   `UnicodeWhitespace`, `UnicodeDash`, the whole normalizer family - moved down 
to
   `opennlp-runtime`. Nothing table-shaped is left in the API. One consequence 
to sanity-check:
   `AbstractDL` uses `CharClass` for input chunking, so `opennlp-dl` now 
compile-depends on
   `opennlp-runtime` (it was test-scope before). That's the minimal fix; if 
you'd rather DL
   stay api-thin, the alternative is to inject the normalizer through 
`InferenceOptions` just LMK.  
   I don't have an opin in either direction for that one.
   
   **DL changes are behavioral.** The `InferenceOptions` flags, 
`normalizeInput`, and
   the span-decode rework are behavioral changes to existing components, not 
purely additive, so
   I'll go ahead and make that a separate review.  There were good reasons for 
this: feeding offset-shifting normalization into the DL models requires the 
span decode to map predictions back to the original text rather than the 
normalized buffer.  Therefore without that rework, 
`NameFinderDL`/`DocumentCategorizerDL` would emit spans pointing at the wrong 
characters whenever normalization changed the input length.  However, the 
output remains unchanged
   
   ## Reviewability splitting it up
   
   Rather than stack onto the in-flight PRs (this is self-contained and would 
just tangle with those), I'd split *this* along its natural dependency seam 
into two stacks:
   
   - **Stack 1 - normalization foundation:** the API contracts + the runtime 
normalizer engine
     (`CharClass`, the normalizer family, `Dimension`, `TextNormalizer`, the 
offset model) + the
     DL opt-ins.
   - **Stack 2 - UAX #29 tokenizer + advanced:** the tokenizer and its 
boundary/perf engine (where
     the lookup tables live), the layered `Term` model, confusable folding (UTS 
#39), and the
     per-language profiles
   
   That contains the perf/lookup-table and DL-behavior conversations entirely 
in Stack 2 and lets
   the foundation land clean. Your six-way split would also work; I leaned to 
two because the seam
   between foundation and tokenizer is the one real dependency boundary, and 
the rest are more
   "commits within a stack" than independently-landable units.
   
   ## The speed framing (for context, not a competition)
   
   Numbers from JMH harness (2 forks, warmed up), vs **Lucene 10.3.2**
   `StandardTokenizer` and today's `SimpleTokenizer`, Latin corpus (Mchars/s):
   
   | tokenizer | Mchars/s | vs OpenNLP today | vs Lucene 10 |
   |---|---|---|---|
   | OpenNLP `SimpleTokenizer` (today) | 282 | 1.00× | 0.88× |
   | Lucene 10.3.2 `StandardTokenizer` | 320 | 1.13× | 1.00× |
   | new — boundary scan | 388 | 1.38× | 1.21× |
   | new — streaming tokenize | 335 | 1.19× | 1.05× |
   
   The boundary output is byte-identical before and after the perf work — 
1944/1944 on the official
   UAX #29 conformance suite throughout; the transition table is *derived from* 
the readable rule
   cascade at class-load, so the rules stay the source of truth. The public 
surface (the `Tokenizer`
   impl, the streaming handler, the `Term` / `Dimension` layering) is 
independent of the engine
   internals, so we can retune or swap the table without touching the 
interface. The Lucene
   head-to-head benchmark stays out of the repo (it'd pull a Lucene dependency 
- 
   I don't think we should but I was just curious), but I'll share the harness 
in a temporary commit.
   
   OK to cut the two stacks?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: Improve Whitespace UTF normalization (opennlp)

Reply via email to