krickert commented on PR #1103:
URL: https://github.com/apache/opennlp/pull/1103#issuecomment-4780105455
@rzo1 Thanks — both of your points on the foundation are addressed.
**Split into 1a + 1b (done).** I split the foundation along the history
exactly where you suggested:
- **#1108 — engine (1a):**
`CharClass`/`CodePointSet`/`UnicodeWhitespace`/`UnicodeDash`, the
per-code-point rungs, `Dimension`, the non-aligned `TextNormalizer`, and
`confusables.txt` with all its `LICENSE`/`NOTICE`/`rat-excludes` bookkeeping.
Mostly mechanical substitution, and where the license review belongs.
- **#1109 — offset/alignment layer (1b):** `Alignment`, `AlignedText`,
`OffsetAwareNormalizer`, `buildAligned()`, the `*Aligned` `CharClass` variants,
and the dense span-mapping tests (binary-search mapping, expansion/deletion
edge cases). The conceptually hard ~800 lines, isolated for a focused read.
`#1104` (tokenizer) now bases on `#1109`. So the stack is now **1a → 1b →
tokenizer → DL → docs**, each well under your ~1.5k-real-code target, and the
10k-line `confusables.txt` data file is contained in 1a. I closed `#1103`
pointing at the two replacements.
**Static-initializer resource loading (done, and generalized).** Agreed on
the rule. All three bundled-data loaders that did classpath I/O in a `static
{}` block now load lazily on first use through a double-checked accessor, so a
resource the loader can't see surfaces as a catchable exception at call time
rather than an `ExceptionInInitializerError` that poisons the class:
- `Confusables` (1a / #1108)
- `WordBreakProperty` and `ExtendedPictographic` (tokenizer / #1104) — the
latter two wrap their tables in a small immutable holder loaded via the same
pattern.
The `List.of(...)` static blocks in `UnicodeWhitespace`/`UnicodeDash` are
left as-is (no I/O, no classloader risk), as you noted.
Each layer builds and tests green on its own (`mvn -pl … -am verify`, plus
checkstyle + forbiddenapis across the full reactor).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]