[PR] [fix](fe) Reject invalid char filter replacement in tokenize [doris]

via GitHub Wed, 24 Jun 2026 04:41:25 -0700


hoshinojyunn opened a new pull request, #64794:
URL: https://github.com/apache/doris/pull/64794


   Issue Number: None
   
   Related PR: None
   
   Problem Summary: Doris FE accepted empty or multi-character 
char_filter_replacement values for inverted-index char_replace configuration. 
The BE char_replace implementation only supports replacing with a single byte, 
so invalid configurations could silently produce incorrect tokenize results. 
This change centralizes FE char filter validation, requires 
char_filter_replacement to be a single non-empty character when specified, and 
reuses the same validation for tokenize() analysis and inverted index property 
checks.
   
   Reject empty or multi-character char_filter_replacement values in tokenize() 
and inverted index property validation.
   
   - Test: FE unit test
       - ./run-fe-ut.sh --run 
org.apache.doris.analysis.InvertedIndexPropertiesTest,org.apache.doris.nereids.trees.expressions.functions.scalar.TokenizeTest
       - Regression test added but not run because 
regression-test/conf/regression-conf.groovy currently points to a 
user-configured external cluster
   - Behavior changed: Yes (invalid char_filter_replacement is now rejected 
during FE analysis/validation)
   - Does this need documentation: No
   
   ### What problem does this PR solve?
   
   Issue Number: close #xxx
   
   Related PR: #xxx
   
   Problem Summary:
   FE previously accepted invalid `char_filter_replacement` values in 
inverted-index `char_replace` configuration and `tokenize()` properties. This 
could pass analysis successfully but produce incorrect results in BE, because 
the BE `char_replace` implementation only supports a single-byte
   replacement.
   
   Two concrete examples are:
   
   1. Multi-character replacement:
    ```sql
    SELECT tokenize(
        'a.b.c',
        
'"parser"="english","char_filter_type"="char_replace","char_filter_pattern"=".","char_filter_replacement"="xyz"'
    );
   ```
   Before this change, FE accepted the input, but the actual result was:
   
   [{"token":"axbxc"}]
   
   while the intuitive expected behavior for replacing . with xyz would be:
   
   [{"token":"axyzbxyzc"}]
   
   2. Empty replacement:
   ```sql
    SELECT tokenize(
        'a.b.c',
        
'"parser"="english","char_filter_type"="char_replace","char_filter_pattern"=".","char_filter_replacement"=""'
    );
   ```
    Before this change, FE also accepted the input, but the actual result was:
   
    [{"token":"a"},{"token":"b"},{"token":"c"}]
    while the expected result for removing . would be:
   
    [{"token":"abc"}]
   
   The root cause is that FE did not validate char_filter_replacement strictly 
enough, while BE only handles single-byte replacement correctly. This PR fixes 
the issue by centralizing char filter validation in FE and reusing it from both 
inverted index property validation and
   Tokenize.checkLegalityBeforeTypeCoercion().
   
   After this change, FE rejects char_filter_replacement unless it is a single 
non-empty character, preventing these invalid configurations from reaching BE.
   
   This PR also adds FE unit tests to cover:
   
   - the wrapped exception path in Tokenize.checkLegalityBeforeTypeCoercion()
   - every branch and every exception path in 
InvertedIndexUtil.checkCharFilterProperties()
   
   ### Release note
   
   tokenize() and inverted index property validation now reject empty or 
multi-character char_filter_replacement values during FE analysis.
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [ ] Regression test
       - [ ] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason <!-- Add your reason?  -->
   
   - Behavior changed:
       - [ ] No.
       - [ ] Yes. <!-- Explain the behavior change -->
   
   - Does this need documentation?
       - [ ] No.
       - [ ] Yes. <!-- Add document PR link here. eg: 
https://github.com/apache/doris-website/pull/1214 -->
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label <!-- Add branch pick label that this PR should 
merge into -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [fix](fe) Reject invalid char filter replacement in tokenize [doris]

Reply via email to