hoshinojyunn opened a new pull request, #64794:
URL: https://github.com/apache/doris/pull/64794
Issue Number: None
Related PR: None
Problem Summary: Doris FE accepted empty or multi-character
char_filter_replacement values for inverted-index char_replace configuration.
The BE char_replace implementation only supports replacing with a single byte,
so invalid configurations could silently produce incorrect tokenize results.
This change centralizes FE char filter validation, requires
char_filter_replacement to be a single non-empty character when specified, and
reuses the same validation for tokenize() analysis and inverted index property
checks.
Reject empty or multi-character char_filter_replacement values in tokenize()
and inverted index property validation.
- Test: FE unit test
- ./run-fe-ut.sh --run
org.apache.doris.analysis.InvertedIndexPropertiesTest,org.apache.doris.nereids.trees.expressions.functions.scalar.TokenizeTest
- Regression test added but not run because
regression-test/conf/regression-conf.groovy currently points to a
user-configured external cluster
- Behavior changed: Yes (invalid char_filter_replacement is now rejected
during FE analysis/validation)
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
FE previously accepted invalid `char_filter_replacement` values in
inverted-index `char_replace` configuration and `tokenize()` properties. This
could pass analysis successfully but produce incorrect results in BE, because
the BE `char_replace` implementation only supports a single-byte
replacement.
Two concrete examples are:
1. Multi-character replacement:
```sql
SELECT tokenize(
'a.b.c',
'"parser"="english","char_filter_type"="char_replace","char_filter_pattern"=".","char_filter_replacement"="xyz"'
);
```
Before this change, FE accepted the input, but the actual result was:
[{"token":"axbxc"}]
while the intuitive expected behavior for replacing . with xyz would be:
[{"token":"axyzbxyzc"}]
2. Empty replacement:
```sql
SELECT tokenize(
'a.b.c',
'"parser"="english","char_filter_type"="char_replace","char_filter_pattern"=".","char_filter_replacement"=""'
);
```
Before this change, FE also accepted the input, but the actual result was:
[{"token":"a"},{"token":"b"},{"token":"c"}]
while the expected result for removing . would be:
[{"token":"abc"}]
The root cause is that FE did not validate char_filter_replacement strictly
enough, while BE only handles single-byte replacement correctly. This PR fixes
the issue by centralizing char filter validation in FE and reusing it from both
inverted index property validation and
Tokenize.checkLegalityBeforeTypeCoercion().
After this change, FE rejects char_filter_replacement unless it is a single
non-empty character, preventing these invalid configurations from reaching BE.
This PR also adds FE unit tests to cover:
- the wrapped exception path in Tokenize.checkLegalityBeforeTypeCoercion()
- every branch and every exception path in
InvertedIndexUtil.checkCharFilterProperties()
### Release note
tokenize() and inverted index property validation now reject empty or
multi-character char_filter_replacement values during FE analysis.
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
https://github.com/apache/doris-website/pull/1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR should
merge into -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]