krickert opened a new pull request, #1138: URL: https://github.com/apache/opennlp/pull/1138
Part of the OPENNLP-1852 epic (offset-aware normalization, provenance-tagged data, and an emoji annotation layer). Stacks on the OPENNLP-1850 stack, based on OPENNLP-1850-4-docs. ## Summary Adds Unicode full case folding (UTS #21) as an offset-aware rung, using a bundled CaseFolding.txt table. Unlike the existing locale-based caseFold(), full case folding includes the expanding folds plain lower casing does not perform: sharp s to ss, the Latin ligatures, and the Greek and Armenian multi-character folds. Because the table is authored with known source and target lengths, the rung builds its Alignment during the substitution pass, so it is offset-aware with no ICU4J dependency. This is the offset-aware pipeline's first authored expanding fold beyond the OPENNLP-1850 CharClass rungs, and it proves the buildAligned() contract end to end on real data. ## What's included - FullCaseFoldCharSequenceNormalizer, an OffsetAwareNormalizer backed by the bundled CaseFolding.txt (C common and F full status rows; S simple and T Turkic are recognized and intentionally skipped). - TextNormalizer.Builder.fullCaseFold(), composing into buildAligned(). - A FULL_CASE_FOLD Dimension and TermAnalyzer.Builder.fullCaseFold(), so the Term model gets the same fold. - LICENSE/NOTICE/NOTICE.template attribution for the bundled CaseFolding.txt (Unicode License V3). - Manual updates to the Text Normalization chapter. - A completeness audit test against the bundled data (1585 C+F rows), plus round-trip tests covering a supplementary-plane fold, a 1-to-3 code point expansion, consecutive expanding folds in one string, and canonical-order coverage in TermAnalyzer. - Fails loud on an unrecognized status letter in the data, matching the fail-loud contract already applied to Confusables and WordBreakProperty in OPENNLP-1850. ## Design note java.text.Normalizer (NFC/NFKC) reports no edits and stays offset-opaque in buildAligned(). This fold is authored data instead, so the Alignment comes for free from the substitution pass, no ICU4J dependency needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
