Copilot commented on code in PR #1105:
URL: https://github.com/apache/opennlp/pull/1105#discussion_r3446741571
##########
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -327,6 +329,63 @@ protected static void validateSplitOptions(final int
documentSplitSize, final in
}
}
+ /**
+ * Unicode-aware whitespace. Input is tokenized on the full Unicode {@code
White_Space} set
+ * rather than the six ASCII characters Java's {@code \s} recognizes, and
the same class is
+ * reused by subclasses that need to match against whitespace in the source
text.
+ */
+ protected static final CharClass WHITESPACE = CharClass.whitespace();
+
+ /** Unicode dashes (excluding the mathematical minus signs), used for
optional input folding. */
+ protected static final CharClass DASHES = CharClass.dashes();
+
+ /**
+ * Optionally folds Unicode whitespace and/or dashes in the input to their
ASCII forms before
+ * inference. Each member code point maps to exactly one ASCII character, so
the transform is
+ * offset preserving for Basic Multilingual Plane characters and any spans a
model produces still
+ * align with the input.
+ *
+ * @param text The input text.
+ * @param normalizeWhitespace Whether to fold whitespace to ASCII spaces.
+ * @param normalizeDashes Whether to fold dashes to the ASCII hyphen.
+ * @return The optionally normalized text.
+ */
+ protected static String normalizeInput(final String text, final boolean
normalizeWhitespace,
+ final boolean normalizeDashes) {
+ String result = text;
+ if (normalizeWhitespace) {
+ result = WHITESPACE.normalize(result).toString();
+ }
+ if (normalizeDashes) {
+ result = DASHES.normalize(result).toString();
+ }
+ return result;
+ }
Review Comment:
`CharClass.dashes()` includes supplementary-plane dash code points (e.g.,
U+10D6E). `DASHES.normalize(...)` replaces them with a single hyphen, shrinking
the UTF-16 length and breaking the “offset-safe” guarantee (sentence offsets /
spans can become misaligned). Consider leaving supplementary dashes unchanged
(or otherwise preserving UTF-16 length) when `normalizeDashes` is enabled.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]