Re: [PR] OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7) (opennlp)

via GitHub Thu, 25 Jun 2026 04:20:48 -0700


krickert commented on PR #1110:
URL: https://github.com/apache/opennlp/pull/1110#issuecomment-4798666813


   @rzo1 Both points on the tokenizer PR are addressed.
   
   **Resource loading (done).** `WordBreakProperty` and `ExtendedPictographic` 
no longer load in a `static {}` block — each now loads lazily on first use via 
a double-checked accessor, so a resource the loader can't see is a catchable 
exception at call time rather than an `ExceptionInInitializerError` that 
poisons the class (and would otherwise take the whole 
`WordSegmenter`/`WordTokenizer` down). The `getResourceAsStream` in 
`WordBoundaryConformanceTest` is left as-is (test-only).
   
   **Split into 2a / 2b / 2c (done).** Along the three concepts you identified:
   - **#1110** — UAX #29 tokenizer 
(`WordSegmenter`/`WordTokenizer`/`WordType`/`WordToken`/`WordBreak`/`WordBreakProperty`/`ExtendedPictographic`
 + bundled data + the conformance suite). Self-contained, the bulk of the work.
   - **#1111** — Term model (`Term`/`TermAnalyzer`), on 2a.
   - **#1112** — `NormalizationProfile` registry, on 2b.
   
   `#1105` (DL) now bases on `#1112`; I closed `#1104` pointing here. Each 
layer builds and tests green on its own.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7) (opennlp)

Reply via email to