Re: [PR] OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) (opennlp)

via GitHub Tue, 23 Jun 2026 10:10:19 -0700


krickert commented on PR #1104:
URL: https://github.com/apache/opennlp/pull/1104#issuecomment-4781634866


   @rzo1 Both points on the tokenizer PR are addressed.
   
   **Resource loading (already done).** `WordBreakProperty` and 
`ExtendedPictographic` no longer load in a `static {}` block — both now load 
lazily on first use through a double-checked accessor (their tables behind a 
small immutable holder), so a resource the class's loader can't see is a 
catchable exception at call time rather than an `ExceptionInInitializerError` 
that poisons the class (and, as you noted, would otherwise take the whole 
`WordSegmenter` / `WordTokenizer` down). This shipped in the tokenizer before 
the split and rode into **#1110**. The `getResourceAsStream` in 
`WordBoundaryConformanceTest` is left as-is (test-only).
   
   **Split into 2a / 2b / 2c (done).** Split the tokenizer PR along the three 
concepts you identified:
   
   - **#1110 — UAX #29 tokenizer (2a):** `WordSegmenter`, ` WordTokenizer`, 
`WordType`, `WordToken`, `WordBreak`, `WordBreakProperty`, 
`ExtendedPictographic`, the bundled `Word_Break` / `Extended_Pictographic` data 
+ `WordBreakTest.txt` conformance suite, and the data 
`LICENSE`/`NOTICE`/`rat-excludes`. Self-contained, the bulk of the work.
   - **#1111 — Term model (2b):** `Term`, `TermAnalyzer`, the `Dimension` 
`{@link}` restore, and their tests. Bases on 2a (TermAnalyzer segments with the 
UAX #1110 ).
   - **#1112 — NormalizationProfile registry (2c):** `NormalizationProfile`, 
`NormalizationProfiles`, and tests. Bases on 2b.
   
   `#1105` (DL) now bases on `#1112`; I closed `#1104` pointing at the three. 
The full stack is now **1a → 1b → 2a→ 2b → 2c → DL → docs**, each conceptually 
scoped and well under the ~1.5k-real-code mark, with the ~4k lines of bundled 
UCD data contained in 2a.
   
   Each layer builds and tests green on its own (`mvn -pl …
    -am verify`; full reactor compiles + passes checkstyle/ forbiddenapis).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) (opennlp)

Reply via email to