github-actions[bot] commented on code in PR #64794:
URL: https://github.com/apache/doris/pull/64794#discussion_r3478900069


##########
fe/fe-core/src/main/java/org/apache/doris/analysis/InvertedIndexUtil.java:
##########
@@ -143,6 +143,34 @@ private static boolean isSingleByte(String str) {
         return true;
     }
 
+    public static void checkCharFilterProperties(Map<String, String> 
properties) throws AnalysisException {
+        String charFilterType = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_TYPE);
+        if (charFilterType == null) {
+            return;
+        }
+
+        String charFilterPattern = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_PATTERN);
+        String charFilterReplacement = 
properties.get(INVERTED_INDEX_PARSER_CHAR_FILTER_REPLACEMENT);
+        if (!INVERTED_INDEX_CHAR_FILTER_CHAR_REPLACE.equals(charFilterType)) {
+            throw new AnalysisException("Invalid 'char_filter_type', only '"
+                    + INVERTED_INDEX_CHAR_FILTER_CHAR_REPLACE + "' is 
supported");
+        }
+        if (charFilterPattern == null || charFilterPattern.isEmpty()) {
+            throw new AnalysisException("Missing 'char_filter_pattern' for 
'char_replace' filter type");
+        }
+        if (!isSingleByte(charFilterPattern)) {
+            throw new AnalysisException("'char_filter_pattern' must contain 
only ASCII characters");
+        }
+        if (charFilterReplacement != null) {
+            if (charFilterReplacement.isEmpty() || 
charFilterReplacement.length() != 1) {

Review Comment:
   This still lets a replacement like `é` through even though BE consumes 
replacements as bytes. `String.length()` only checks UTF-16 code units, and 
`isSingleByte()` accepts any char `<= 0xFF`, so `é` passes here. In the BE 
`TOKENIZE` path the property is UTF-8 bytes in a `std::string`, and 
`CharReplaceCharFilter::process_pattern` replaces with only `_replacement[0]`, 
which silently truncates the two-byte UTF-8 sequence to its first byte. Please 
validate the encoded byte length instead, or restrict the replacement to ASCII, 
and add a regression case for a value like `é`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to