Re: [PR] OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) (opennlp)

via GitHub Sun, 21 Jun 2026 10:03:29 -0700


rzo1 commented on PR #1104:
URL: https://github.com/apache/opennlp/pull/1104#issuecomment-4762677893


   Thx for the PR. Here are some suggestions:
   
   - Term.at() ordering footgun. For an unconfigured dimension, at() applies it 
on top of normalized() (the final configured form) rather than in canonical 
pipeline order. Given the non-commutativity your own Dimension javadoc warns 
about (case- vs accent-fold, Turkish i, German ss), requesting a dimension that 
sits earlier than configured ones yields a linguistically different result. 
Either restrict at() to configured plus more-aggressive dimensions, or document 
the constraint with an example.
   - WordType.IDEOGRAPHIC javadoc says "A single Han ideograph", but 
WordType.of returns it for any token containing a Han code point. True in 
practice under UAX #29, but the doc overstates the guarantee.
   - WordTokenizer deliberately doesn't extend AbstractTokenizer and implements 
Tokenizer directly. Reasonable, but a one-line comment noting the deliberate 
divergence would help future readers.
   - Nit: IntList capacity doubling (values.length * 2) overflows for more than 
~2^30 boundaries. Purely theoretical, and it matches patterns elsewhere.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) (opennlp)

Reply via email to