airborne12 commented on code in PR #64794:
URL: https://github.com/apache/doris/pull/64794#discussion_r3481073394
##########
fe/fe-core/src/main/java/org/apache/doris/analysis/InvertedIndexUtil.java:
##########
@@ -134,15 +134,43 @@ public static void checkInvertedIndexParser(String
indexColName, PrimitiveType c
}
}
- private static boolean isSingleByte(String str) {
+ private static boolean isAscii(String str) {
for (int i = 0; i < str.length(); i++) {
- if (str.charAt(i) > 0xFF) {
+ if (str.charAt(i) > 0x7F) {
return false;
}
}
return true;
}
+ public static void checkCharFilterProperties(Map<String, String>
properties) throws AnalysisException {
Review Comment:
Blocking: this helper tightens validation for the legacy `char_filter_*`
properties and the `TOKENIZE()` path that reuses them, but the
index-policy/custom-analyzer path still uses `CharReplaceCharFilterValidator`
independently. That validator currently accepts `replacement.length() == 1 &&
replacement.charAt(0) <= 255`, so `replacement="é"` passes FE validation. The
BE `CharReplaceCharFilterFactory` checks UTF-8 byte length and rejects the same
value with `_replacement.size() != 1`, so custom analyzers can still create the
FE/BE mismatch this PR is trying to eliminate. Please make the policy validator
reuse the same validation logic, mapping policy `pattern`/`replacement` to
these fields, and add a policy test for non-ASCII replacement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]