Re: [PR] [fix](fe) Reject invalid char filter replacement in tokenize [doris]

via GitHub Fri, 26 Jun 2026 04:27:58 -0700


airborne12 commented on code in PR #64794:
URL: https://github.com/apache/doris/pull/64794#discussion_r3481073394



##########
fe/fe-core/src/main/java/org/apache/doris/analysis/InvertedIndexUtil.java:
##########
@@ -134,15 +134,43 @@ public static void checkInvertedIndexParser(String 
indexColName, PrimitiveType c
         }
     }
 
-    private static boolean isSingleByte(String str) {
+    private static boolean isAscii(String str) {
         for (int i = 0; i < str.length(); i++) {
-            if (str.charAt(i) > 0xFF) {
+            if (str.charAt(i) > 0x7F) {
                 return false;
             }
         }
         return true;
     }
 
+    public static void checkCharFilterProperties(Map<String, String> 
properties) throws AnalysisException {

Review Comment:
   Blocking: this helper tightens validation for the legacy `char_filter_*` 
properties and the `TOKENIZE()` path that reuses them, but the 
index-policy/custom-analyzer path still uses `CharReplaceCharFilterValidator` 
independently. That validator currently accepts `replacement.length() == 1 && 
replacement.charAt(0) <= 255`, so `replacement="é"` passes FE validation. The 
BE `CharReplaceCharFilterFactory` checks UTF-8 byte length and rejects the same 
value with `_replacement.size() != 1`, so custom analyzers can still create the 
FE/BE mismatch this PR is trying to eliminate. Please make the policy validator 
reuse the same validation logic, mapping policy `pattern`/`replacement` to 
these fields, and add a policy test for non-ASCII replacement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [fix](fe) Reject invalid char filter replacement in tokenize [doris]

Reply via email to