krickert opened a new pull request, #1138:
URL: https://github.com/apache/opennlp/pull/1138

   Part of the OPENNLP-1852 epic (offset-aware normalization, provenance-tagged 
data, and an emoji annotation layer). Stacks on the OPENNLP-1850 stack, based 
on OPENNLP-1850-4-docs.
   
   ## Summary
   
   Adds Unicode full case folding (UTS #21) as an offset-aware rung, using a 
bundled CaseFolding.txt table. Unlike the existing locale-based caseFold(), 
full case folding includes the expanding folds plain lower casing does not 
perform: sharp s to ss, the Latin ligatures, and the Greek and Armenian 
multi-character folds.
   
   Because the table is authored with known source and target lengths, the rung 
builds its Alignment during the substitution pass, so it is offset-aware with 
no ICU4J dependency. This is the offset-aware pipeline's first authored 
expanding fold beyond the OPENNLP-1850 CharClass rungs, and it proves the 
buildAligned() contract end to end on real data.
   
   ## What's included
   
   - FullCaseFoldCharSequenceNormalizer, an OffsetAwareNormalizer backed by the 
bundled CaseFolding.txt (C common and F full status rows; S simple and T Turkic 
are recognized and intentionally skipped).
   - TextNormalizer.Builder.fullCaseFold(), composing into buildAligned().
   - A FULL_CASE_FOLD Dimension and TermAnalyzer.Builder.fullCaseFold(), so the 
Term model gets the same fold.
   - LICENSE/NOTICE/NOTICE.template attribution for the bundled CaseFolding.txt 
(Unicode License V3).
   - Manual updates to the Text Normalization chapter.
   - A completeness audit test against the bundled data (1585 C+F rows), plus 
round-trip tests covering a supplementary-plane fold, a 1-to-3 code point 
expansion, consecutive expanding folds in one string, and canonical-order 
coverage in TermAnalyzer.
   - Fails loud on an unrecognized status letter in the data, matching the 
fail-loud contract already applied to Confusables and WordBreakProperty in 
OPENNLP-1850.
   
   ## Design note
   
   java.text.Normalizer (NFC/NFKC) reports no edits and stays offset-opaque in 
buildAligned(). This fold is authored data instead, so the Alignment comes for 
free from the substitution pass, no ICU4J dependency needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to