krickert commented on code in PR #1105:
URL: https://github.com/apache/opennlp/pull/1105#discussion_r3447259087
##########
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -327,6 +329,63 @@ protected static void validateSplitOptions(final int
documentSplitSize, final in
}
}
+ /**
+ * Unicode-aware whitespace. Input is tokenized on the full Unicode {@code
White_Space} set
+ * rather than the six ASCII characters Java's {@code \s} recognizes, and
the same class is
+ * reused by subclasses that need to match against whitespace in the source
text.
+ */
+ protected static final CharClass WHITESPACE = CharClass.whitespace();
+
+ /** Unicode dashes (excluding the mathematical minus signs), used for
optional input folding. */
+ protected static final CharClass DASHES = CharClass.dashes();
+
+ /**
+ * Optionally folds Unicode whitespace and/or dashes in the input to their
ASCII forms before
+ * inference. Each member code point maps to exactly one ASCII character, so
the transform is
+ * offset preserving for Basic Multilingual Plane characters and any spans a
model produces still
+ * align with the input.
+ *
+ * @param text The input text.
+ * @param normalizeWhitespace Whether to fold whitespace to ASCII spaces.
+ * @param normalizeDashes Whether to fold dashes to the ASCII hyphen.
+ * @return The optionally normalized text.
+ */
+ protected static String normalizeInput(final String text, final boolean
normalizeWhitespace,
+ final boolean normalizeDashes) {
+ String result = text;
+ if (normalizeWhitespace) {
+ result = WHITESPACE.normalize(result).toString();
+ }
+ if (normalizeDashes) {
+ result = DASHES.normalize(result).toString();
+ }
+ return result;
+ }
Review Comment:
Fixed with an offset map rather than restricting the fold.
`AbstractDL.normalizeInputMapped` folds the input (all dashes included) and
returns an `OffsetMap` back to the original text, so positions stay correct
across any length change, shrink or expansion. `NameFinderDL` adds
`findInOriginal(String[])`, which maps decoded spans back to original-input
coordinates through that map; the existing `find(String[])` is preserved but
deprecated. Model-free tests cover the supplementary-dash shrink and the
length-preserving identity case. `DocumentCategorizerDL` is unaffected since it
returns category scores, not positions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]