krickert commented on PR #1101:
URL: https://github.com/apache/opennlp/pull/1101#issuecomment-4755649750
Thanks for the thorough pass.
Below I touch on every point.
It's all pushed now (through `5ab7f873`); and did a first pass. If you are
OK with it - I can do 2 smaller PRs on top of this (I think that's easiest -
but if you want it to be further broken down I can figure it out).
## Licensing / release plumbing - done
- **NOTICE.template, not just NOTICE.** The Unicode attribution now lives in
`src/license/NOTICE.template`, so it survives regeneration (same as the
stopword /
spellchecker entries). The generated `NOTICE` is updated to match.
- **rat-excludes.** Added the four bundled `.txt` paths under an
`OPENNLP-1850` comment
(`WordBreakProperty.txt`, `ExtendedPictographic.txt`, `confusables.txt`,
and the
`WordBreakTest.txt` test fixture).
- **LICENSE carries the actual text.** Embedded the full Unicode License V3
in `LICENSE`,
the way the stopword lists embed their BSD text - since the newer Unicode
headers only
link to `terms_of_use.html` rather than inlining it, a URL in NOTICE
wasn't enough on
its own.
- **ExtendedPictographic.txt wording.** Corrected - it's described as
derived from
`emoji-data.txt` by keeping only the `Extended_Pictographic` property (451
lines,
renamed), not an unmodified copy. The wording also notes the upstream file
carries the
other five emoji properties that aren't retained.
## Architecture
**One token-analysis entry point, one place that defines the steps - done.**
This wasn't bad at all, it also shrinks the surface:
- `TextAnalyzer` / `AnalyzedToken` are deleted. `TermAnalyzer` / `Term` is
the single
entry point - `Term.original()` / `normalized()` / `span()` give you
exactly what
`AnalyzedToken` did, on the UAX #29 tokenizer instead of a second one.
- Each `Dimension` now carries its own default `CharSequenceNormalizer`
(lazily, so the
confusables table isn't loaded just by touching the enum). `TermAnalyzer`
dropped its
parallel `defaultTransforms()` map, and `TextNormalizer`'s nfc / nfkc /
whitespace /
dash / case-fold / accent-fold rungs now read from `Dimension`. The six
shared rungs are
defined once. (`TextNormalizer`'s standalone cleanup steps - quotes,
digits, ellipsis,
bullets, strip-invisible - I left as-is for now since they already live in
one place;
promoting them to `Dimension`s is an easy follow-up if you'd rather
everything be
layer-able. LMK)
**api vs runtime split is completed.** `opennlp-api/util/normalizer` is now
exactly three files:
`CharSequenceNormalizer` (the contract) plus the table-free value types
`NormalizedText`
and `OffsetMap`. Everything with data or an engine - `CharClass`,
`CodePointSet`,
`UnicodeWhitespace`, `UnicodeDash`, the whole normalizer family - moved down
to
`opennlp-runtime`. Nothing table-shaped is left in the API. One consequence
to sanity-check:
`AbstractDL` uses `CharClass` for input chunking, so `opennlp-dl` now
compile-depends on
`opennlp-runtime` (it was test-scope before). That's the minimal fix; if
you'd rather DL
stay api-thin, the alternative is to inject the normalizer through
`InferenceOptions` just LMK.
I don't have an opin in either direction for that one.
**DL changes are behavioral.** The `InferenceOptions` flags,
`normalizeInput`, and
the span-decode rework are behavioral changes to existing components, not
purely additive, so
I'll go ahead and make that a separate review. There were good reasons for
this: feeding offset-shifting normalization into the DL models requires the
span decode to map predictions back to the original text rather than the
normalized buffer. Therefore without that rework,
`NameFinderDL`/`DocumentCategorizerDL` would emit spans pointing at the wrong
characters whenever normalization changed the input length. However, the
output remains unchanged
## Reviewability splitting it up
Rather than stack onto the in-flight PRs (this is self-contained and would
just tangle with those), I'd split *this* along its natural dependency seam
into two stacks:
- **Stack 1 - normalization foundation:** the API contracts + the runtime
normalizer engine
(`CharClass`, the normalizer family, `Dimension`, `TextNormalizer`, the
offset model) + the
DL opt-ins.
- **Stack 2 - UAX #29 tokenizer + advanced:** the tokenizer and its
boundary/perf engine (where
the lookup tables live), the layered `Term` model, confusable folding (UTS
#39), and the
per-language profiles
That contains the perf/lookup-table and DL-behavior conversations entirely
in Stack 2 and lets
the foundation land clean. Your six-way split would also work; I leaned to
two because the seam
between foundation and tokenizer is the one real dependency boundary, and
the rest are more
"commits within a stack" than independently-landable units.
## The speed framing (for context, not a competition)
Numbers from JMH harness (2 forks, warmed up), vs **Lucene 10.3.2**
`StandardTokenizer` and today's `SimpleTokenizer`, Latin corpus (Mchars/s):
| tokenizer | Mchars/s | vs OpenNLP today | vs Lucene 10 |
|---|---|---|---|
| OpenNLP `SimpleTokenizer` (today) | 282 | 1.00× | 0.88× |
| Lucene 10.3.2 `StandardTokenizer` | 320 | 1.13× | 1.00× |
| new — boundary scan | 388 | 1.38× | 1.21× |
| new — streaming tokenize | 335 | 1.19× | 1.05× |
The boundary output is byte-identical before and after the perf work —
1944/1944 on the official
UAX #29 conformance suite throughout; the transition table is *derived from*
the readable rule
cascade at class-load, so the rules stay the source of truth. The public
surface (the `Tokenizer`
impl, the streaming handler, the `Term` / `Dimension` layering) is
independent of the engine
internals, so we can retune or swap the table without touching the
interface. The Lucene
head-to-head benchmark stays out of the repo (it'd pull a Lucene dependency
-
I don't think we should but I was just curious), but I'll share the harness
in a temporary commit.
OK to cut the two stacks?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]