rzo1 commented on PR #1104: URL: https://github.com/apache/opennlp/pull/1104#issuecomment-4762677893
Thx for the PR. Here are some suggestions: - Term.at() ordering footgun. For an unconfigured dimension, at() applies it on top of normalized() (the final configured form) rather than in canonical pipeline order. Given the non-commutativity your own Dimension javadoc warns about (case- vs accent-fold, Turkish i, German ss), requesting a dimension that sits earlier than configured ones yields a linguistically different result. Either restrict at() to configured plus more-aggressive dimensions, or document the constraint with an example. - WordType.IDEOGRAPHIC javadoc says "A single Han ideograph", but WordType.of returns it for any token containing a Han code point. True in practice under UAX #29, but the doc overstates the guarantee. - WordTokenizer deliberately doesn't extend AbstractTokenizer and implements Tokenizer directly. Reasonable, but a one-line comment noting the deliberate divergence would help future readers. - Nit: IntList capacity doubling (values.length * 2) overflows for more than ~2^30 boundaries. Purely theoretical, and it matches patterns elsewhere. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
