Re: [PR] OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) (opennlp)

via GitHub Sun, 21 Jun 2026 10:03:00 -0700


rzo1 commented on PR #1103:
URL: https://github.com/apache/opennlp/pull/1103#issuecomment-4762676928


   Thx for the PR. Here are some suggestions:
   
   - Dimension javadoc forward-references Term/TermAnalyzer, which only arrive 
in #1104. Standalone javadoc on this branch emits unresolved-reference 
warnings. Either downgrade those {@link} to {@code}, or confirm the stack 
always merges as a unit.
   - Offset mapping isn't reachable through the builder. 
normalizeMapped/OffsetMap exist only directly on CharClass, while 
TextNormalizer.build() composes plain CharSequenceNormalizers and discards 
offsets, so the NormalizedText/OffsetMap capability can't be reached via 
TextNormalizer. Is that intentional for now?
   - OffsetMap buffer growth uses Arrays.copyOf(buffer, buffer.length * 2), 
which overflows to NegativeArraySizeException past ~2^30 chars instead of a 
clean OOM. Overflow-aware growth would be tidier. Edge case, per-document use.
   - Confusables.load() has no per-line guard, so a malformed bundled line 
surfaces as ExceptionInInitializerError, whereas the user-facing 
CodePointSet.fromFile reports the offending line number. Minor inconsistency. A 
checksum or version assertion would also back the "unmodified upstream 17.0.0" 
NOTICE claim.
   - Nit: serialVersionUID = 1L on the new rungs vs random longs on the legacy 
ones, and TextNormalizer.builder() returns a TextNormalizer that is itself the 
mutable builder (a nested Builder would read cleaner).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) (opennlp)

Reply via email to