(doris-website) branch master updated: add custom-normalizer and unicode_normalize (#3114)

yiguolei Tue, 25 Nov 2025 17:41:07 -0800

This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 35bc68d4ffe add custom-normalizer and unicode_normalize (#3114)
35bc68d4ffe is described below

commit 35bc68d4ffe144228489c8bb6607b16e297be947
Author: zzzxl <[email protected]>
AuthorDate: Wed Nov 26 09:40:55 2025 +0800

    add custom-normalizer and unicode_normalize (#3114)
    
    ## Versions
    
    - [x] dev
    - [ ] 4.x
    - [ ] 3.x
    - [ ] 2.1
    
    ## Languages
    
    - [x] Chinese
    - [x] English
    
    ## Docs Checklist
    
    - [ ] Checked by AI
    - [ ] Test Cases Built
---
 docs/ai/text-search/custom-analyzer.md             |   8 ++
 docs/ai/text-search/custom-normalizer.md           | 107 ++++++++++++++++
 .../string-functions/unicode_normalize.md          | 136 +++++++++++++++++++++
 .../current/ai/text-search/custom-analyzer.md      |   8 ++
 .../current/ai/text-search/custom-normalizer.md    | 107 ++++++++++++++++
 .../string-functions/unicode_normalize.md          | 136 +++++++++++++++++++++
 sidebars.ts                                        |   2 +
 7 files changed, 504 insertions(+)

diff --git a/docs/ai/text-search/custom-analyzer.md 
b/docs/ai/text-search/custom-analyzer.md
index f5fd87820fb..39523b41c7e 100644
--- a/docs/ai/text-search/custom-analyzer.md
+++ b/docs/ai/text-search/custom-analyzer.md
@@ -29,6 +29,11 @@ PROPERTIES (
 - Parameters
   - `char_filter_pattern`: characters to replace
   - `char_filter_replacement`: replacement characters (default: space)
+`icu_normalizer`: Preprocess text using ICU normalization.
+- Parameters
+  - `name`: Normalization form (default `nfkc_cf`). Options: `nfc`, `nfkc`, 
`nfkc_cf`, `nfd`, `nfkd`
+  - `mode`: Normalization mode (default `compose`). Options: `compose`, 
`decompose`
+  - `unicode_set_filter`: Specify the character set to normalize (e.g. `[a-z]`)
 
 #### 2. Creating a tokenizer
 
@@ -77,6 +82,9 @@ Available token filters:
 - **ascii_folding**: Converts non-ASCII characters to ASCII equivalents
 - **lowercase**: Converts tokens to lowercase
 - **pinyin**: Converts Chinese characters to pinyin after tokenization. For 
parameter details, refer to the **pinyin** tokenizer above.
+- **icu_normalizer**: Process tokens using ICU normalization.
+  - `name`: Normalization form (default `nfkc_cf`). Options: `nfc`, `nfkc`, 
`nfkc_cf`, `nfd`, `nfkd`
+  - `unicode_set_filter`: Specify the character set to normalize
 
 #### 4. Creating an analyzer
 
diff --git a/docs/ai/text-search/custom-normalizer.md 
b/docs/ai/text-search/custom-normalizer.md
new file mode 100644
index 00000000000..985b1853c64
--- /dev/null
+++ b/docs/ai/text-search/custom-normalizer.md
@@ -0,0 +1,107 @@
+---
+{
+    "title": "Custom Normalizer",
+    "language": "en"
+}
+---
+
+## Overview
+
+Custom Normalizer is used for unified text preprocessing, typically in 
scenarios that do not require tokenization but need normalization (such as 
keyword search). Unlike an Analyzer, a Normalizer does not split text but 
processes the entire text as a single complete Token. It supports combining 
character filters and token filters to achieve functions like case conversion 
and character normalization.
+
+## Using Custom Normalizer
+
+### Create
+
+A custom normalizer consists mainly of character filters (`char_filter`) and 
token filters (`token_filter`).
+
+> Note: For detailed creation methods of `char_filter` and `token_filter`, 
please refer to the [Custom Analyzer] documentation.
+
+```sql
+CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS x_normalizer
+PROPERTIES (
+  "char_filter" = "x_char_filter",          -- Optional, one or more character 
filters
+  "token_filter" = "x_filter1, x_filter2"   -- Optional, one or more token 
filters, executed in order
+);
+```
+
+### View
+
+```sql
+SHOW INVERTED INDEX NORMALIZER;
+```
+
+### Drop
+
+```sql
+DROP INVERTED INDEX NORMALIZER IF EXISTS x_normalizer;
+```
+
+## Usage in Table Creation
+
+Specify the custom normalizer using `normalizer` in the inverted index 
properties.
+
+**Note**: `normalizer` and `analyzer` are mutually exclusive and cannot be 
specified in the same index simultaneously.
+
+```sql
+CREATE TABLE tbl (
+    `id` bigint NOT NULL,
+    `code` text NULL,
+    INDEX idx_code (`code`) USING INVERTED PROPERTIES("normalizer" = 
"x_custom_normalizer")
+)
+...
+```
+
+## Limitations
+
+1. The names referenced in `char_filter` and `token_filter` must exist (either 
built-in or created).
+2. A normalizer can only be dropped if no table is using it.
+3. A `char_filter` or `token_filter` can only be dropped if no normalizer is 
using it.
+4. After using the custom normalizer syntax, it takes about 10 seconds to sync 
to the BE, after which import operations will function normally without errors.
+
+## Complete Example
+
+### Example: Ignoring Case and Special Accents
+
+This example demonstrates how to create a normalizer that converts text to 
lowercase and removes accents (e.g., normalizing `Café` to `cafe`), suitable 
for exact matching that is case-insensitive and accent-insensitive.
+
+```sql
+-- 1. Create a custom token filter (if specific parameters are needed)
+-- Create an ascii_folding filter here
+CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS my_ascii_folding
+PROPERTIES
+(
+    "type" = "ascii_folding",
+    "preserve_original" = "false"
+);
+
+-- 2. Create the normalizer
+-- Combine lowercase (built-in) and my_ascii_folding
+CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS lowercase_ascii_normalizer
+PROPERTIES
+(
+    "token_filter" = "lowercase, my_ascii_folding"
+);
+
+-- 3. Use in table creation
+CREATE TABLE product_table (
+    `id` bigint NOT NULL,
+    `product_name` text NULL,
+    INDEX idx_name (`product_name`) USING INVERTED PROPERTIES("normalizer" = 
"lowercase_ascii_normalizer")
+) ENGINE=OLAP
+DUPLICATE KEY(`id`)
+DISTRIBUTED BY RANDOM BUCKETS 1
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 1"
+);
+
+-- 4. Verify and test
+select tokenize('Café-Products', '"normalizer"="lowercase_ascii_normalizer"');
+```
+
+Result:
+```json
+[
+  {"token":"cafe-products"}
+]
+```
diff --git 
a/docs/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md
 
b/docs/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md
new file mode 100644
index 00000000000..8e39b826a91
--- /dev/null
+++ 
b/docs/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md
@@ -0,0 +1,136 @@
+---
+{
+    "title": "UNICODE_NORMALIZE",
+    "language": "en"
+}
+---
+
+## Description
+
+Performs [Unicode 
Normalization](https://unicode-org.github.io/icu/userguide/transforms/normalization/)
 on the input string.
+
+Unicode normalization is the process of converting equivalent Unicode 
character sequences into a unified form. For example, the character "é" can be 
represented by a single code point (U+00E9) or by "e" + a combining acute 
accent (U+0065 + U+0301). Normalization ensures that these equivalent 
representations are handled uniformly.
+
+## Syntax
+
+```sql
+UNICODE_NORMALIZE(<str>, <mode>)
+```
+
+## Parameters
+
+| Parameter | Description |
+|-----------|-------------|
+| `<str>` | The input string to be normalized. Type: VARCHAR |
+| `<mode>` | The normalization mode, must be a constant string 
(case-insensitive). Supported modes:<br/>- `NFC`: Canonical Decomposition, 
followed by Canonical Composition<br/>- `NFD`: Canonical Decomposition<br/>- 
`NFKC`: Compatibility Decomposition, followed by Canonical Composition<br/>- 
`NFKD`: Compatibility Decomposition<br/>- `NFKC_CF`: NFKC followed by Case 
Folding |
+
+## Return Value
+
+Returns VARCHAR type, representing the normalized result of the input string.
+
+## Examples
+
+1. Difference between NFC and NFD (composed vs decomposed characters)
+
+```sql
+-- 'Café' where é may be in composed form, NFD will decompose it into e + 
combining accent
+SELECT length(unicode_normalize('Café', 'NFC')) AS nfc_len, 
length(unicode_normalize('Café', 'NFD')) AS nfd_len;
+```
+
+```text
++---------+---------+
+| nfc_len | nfd_len |
++---------+---------+
+|       4 |       5 |
++---------+---------+
+```
+
+2. NFKC_CF for case folding
+
+```sql
+SELECT unicode_normalize('ABC 123', 'nfkc_cf') AS result;
+```
+
+```text
++---------+
+| result  |
++---------+
+| abc 123 |
++---------+
+```
+
+3. NFKC handling fullwidth characters (compatibility decomposition)
+
+```sql
+-- Fullwidth digits '１２３' will be converted to halfwidth '123'
+SELECT unicode_normalize('１２３ＡＢＣ', 'NFKC') AS result;
+```
+
+```text
++--------+
+| result |
++--------+
+| 123ABC |
++--------+
+```
+
+4. NFKD handling special symbols (compatibility decomposition)
+
+```sql
+-- ℃ (degree Celsius symbol) will be decomposed to °C
+SELECT unicode_normalize('25℃', 'NFKD') AS result;
+```
+
+```text
++--------+
+| result |
++--------+
+| 25°C   |
++--------+
+```
+
+5. Handling circled numbers
+
+```sql
+-- ① ② ③ circled numbers will be converted to regular digits
+SELECT unicode_normalize('①②③', 'NFKC') AS result;
+```
+
+```text
++--------+
+| result |
++--------+
+| 123    |
++--------+
+```
+
+6. Comparing different modes on the same string
+
+```sql
+SELECT 
+    unicode_normalize('ﬁ', 'NFC') AS nfc_result,
+    unicode_normalize('ﬁ', 'NFKC') AS nfkc_result;
+```
+
+```text
++------------+-------------+
+| nfc_result | nfkc_result |
++------------+-------------+
+| ﬁ          | fi          |
++------------+-------------+
+```
+
+7. String equality comparison scenario
+
+```sql
+-- Use normalization to compare visually identical but differently encoded 
strings
+SELECT unicode_normalize('café', 'NFC') = unicode_normalize('café', 'NFC') AS 
is_equal;
+```
+
+```text
++----------+
+| is_equal |
++----------+
+|        1 |
++----------+
+```
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
index 69b5749dd19..aff791eb211 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
@@ -29,6 +29,11 @@ PROPERTIES (
 - 参数
   - `char_filter_pattern`：需要替换的字符列表
   - `char_filter_replacement`：替换后的字符（默认空格）
+`icu_normalizer`：使用 ICU 标准化对文本进行预处理。
+- 参数
+  - `name`：标准化形式（默认 `nfkc_cf`）。可选：`nfc`、`nfkc`、`nfkc_cf`、`nfd`、`nfkd`
+  - `mode`：标准化模式（默认 `compose`）。可选：`compose`（组合）、`decompose`（分解）
+  - `unicode_set_filter`：指定需要标准化的字符集（如 `[a-z]`）
 
 #### 2. tokenizer（分词器）
 
@@ -81,6 +86,9 @@ PROPERTIES (
     - `type_table`：自定义字符类型映射（如 `[+ => ALPHA, - => ALPHA]`），类型含 
`ALPHA`、`ALPHANUM`、`DIGIT`、`LOWER`、`SUBWORD_DELIM`、`UPPER`
 - `ascii_folding`：将非 ASCII 字符映射为等效 ASCII
 - `lowercase`：将 token 文本转为小写
+- `icu_normalizer`：使用 ICU 标准化对词元进行处理。
+  - `name`：标准化形式（默认 `nfkc_cf`）。可选：`nfc`、`nfkc`、`nfkc_cf`、`nfd`、`nfkd`
+  - `unicode_set_filter`：指定需要标准化的字符集
 
 #### 4. analyzer（分析器）
 
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-normalizer.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-normalizer.md
new file mode 100644
index 00000000000..22ad6166476
--- /dev/null
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-normalizer.md
@@ -0,0 +1,107 @@
+---
+{
+"title": "自定义标准化",
+    "language": "zh-CN"
+}
+---
+
+## 概述
+
+自定义标准化（Normalizer）用于对文本进行统一的预处理，通常用于不需要分词但需要标准化的场景（如关键字搜索）。与分词器（Analyzer）不同，Normalizer
 不会对文本进行切分，而是将整个文本作为一个完整的词项（Token）进行处理，支持组合字符过滤器和词元过滤器，以实现大小写转换、字符归一化等功能。
+
+## 使用自定义标准化
+
+### 创建
+
+自定义标准化器主要由字符过滤器（char_filter）和词元过滤器（token_filter）组成。
+
+> 注意：`char_filter` 和 `token_filter` 的详细创建方式请参考[自定义分词]文档。
+
+```sql
+CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS x_normalizer
+PROPERTIES (
+  "char_filter" = "x_char_filter",          -- 可选，一个或多个字符过滤器
+  "token_filter" = "x_filter1, x_filter2"   -- 可选，一个或多个词元过滤器，按顺序执行
+);
+```
+
+### 查看
+
+```sql
+SHOW INVERTED INDEX NORMALIZER;
+```
+
+### 删除
+
+```sql
+DROP INVERTED INDEX NORMALIZER IF EXISTS x_normalizer;
+```
+
+## 建表中使用自定义标准化
+
+在倒排索引属性中使用 `normalizer` 指定自定义标准化器。
+
+**注意**：`normalizer` 与 `analyzer` 互斥，不能同时在同一个索引中指定。
+
+```sql
+CREATE TABLE tbl (
+    `id` bigint NOT NULL,
+    `code` text NULL,
+    INDEX idx_code (`code`) USING INVERTED PROPERTIES("normalizer" = 
"x_custom_normalizer")
+)
+...
+```
+
+## 使用限制
+
+1. `char_filter` 和 `token_filter` 中引用的名称必须存在（内置或已创建）。
+2. 只有在没有任何表使用 normalizer 的时候才能删除它。
+3. 只有在没有任何 normalizer 使用 char_filter 或 token_filter 的情况下才能删除对应的 filter。
+4. 使用自定义标准化语法 10s 后会被同步到 BE，之后导入正常不会报错。
+
+## 完整示例
+
+### 示例：忽略大小写与特殊重音符号
+
+本示例展示如何创建一个标准化器，将文本转换为小写并移除重音符号（例如将 `Café` 标准化为 `cafe`），适用于不区分大小写和重音的精确匹配。
+
+```sql
+-- 1. 创建自定义词元过滤器（如果需要特定参数）
+-- 此处创建一个 ascii_folding 过滤器
+CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS my_ascii_folding
+PROPERTIES
+(
+    "type" = "ascii_folding",
+    "preserve_original" = "false"
+);
+
+-- 2. 创建标准化器
+-- 组合使用 lowercase（内置）和 my_ascii_folding
+CREATE INVERTED INDEX NORMALIZER IF NOT EXISTS lowercase_ascii_normalizer
+PROPERTIES
+(
+    "token_filter" = "lowercase, my_ascii_folding"
+);
+
+-- 3. 建表使用
+CREATE TABLE product_table (
+    `id` bigint NOT NULL,
+    `product_name` text NULL,
+    INDEX idx_name (`product_name`) USING INVERTED PROPERTIES("normalizer" = 
"lowercase_ascii_normalizer")
+) ENGINE=OLAP
+DUPLICATE KEY(`id`)
+DISTRIBUTED BY RANDOM BUCKETS 1
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 1"
+);
+
+-- 4. 验证测试
+select tokenize('Café-Products', '"normalizer"="lowercase_ascii_normalizer"');
+```
+
+返回结果：
+```json
+[
+  {"token":"cafe-products"}
+]
+```
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md
new file mode 100644
index 00000000000..e712b115f2d
--- /dev/null
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize.md
@@ -0,0 +1,136 @@
+---
+{
+    "title": "UNICODE_NORMALIZE",
+    "language": "zh-CN"
+}
+---
+
+## 描述
+
+对输入字符串进行 [Unicode 
标准化（归一化）](https://unicode-org.github.io/icu/userguide/transforms/normalization/)。
+
+Unicode 标准化是将等价的 Unicode 字符序列转换为统一形式的过程。例如，字符 "é" 可以用单个码点（U+00E9）表示，也可以用 "e" + 
组合重音符号（U+0065 + U+0301）两个码点表示。标准化确保这些等价的表示形式能被统一处理。
+
+## 语法
+
+```sql
+UNICODE_NORMALIZE(<str>, <mode>)
+```
+
+## 参数
+
+| 参数 | 说明 |
+|------|------|
+| `<str>` | 需要进行标准化的输入字符串。类型：VARCHAR |
+| `<mode>` | 标准化模式，必须是常量字符串（不区分大小写）。支持的模式：<br/>- `NFC`: 标准分解后进行标准组合（Canonical 
Decomposition, followed by Canonical Composition）<br/>- `NFD`: 标准分解（Canonical 
Decomposition）<br/>- `NFKC`: 兼容分解后进行标准组合（Compatibility Decomposition, followed 
by Canonical Composition）<br/>- `NFKD`: 兼容分解（Compatibility Decomposition）<br/>- 
`NFKC_CF`: NFKC 后进行大小写折叠（Case Folding） |
+
+## 返回值
+
+返回 VARCHAR 类型，表示输入字符串标准化后的结果。
+
+## 示例
+
+1. NFC 与 NFD 的区别（组合字符 vs 分解字符）
+
+```sql
+-- 'Café' 中 é 可能是组合形式，NFD 会将其分解为 e + 组合重音符
+SELECT length(unicode_normalize('Café', 'NFC')) AS nfc_len, 
length(unicode_normalize('Café', 'NFD')) AS nfd_len;
+```
+
+```text
++---------+---------+
+| nfc_len | nfd_len |
++---------+---------+
+|       4 |       5 |
++---------+---------+
+```
+
+2. NFKC_CF 进行大小写折叠
+
+```sql
+SELECT unicode_normalize('ABC 123', 'nfkc_cf') AS result;
+```
+
+```text
++---------+
+| result  |
++---------+
+| abc 123 |
++---------+
+```
+
+3. NFKC 处理全角字符（兼容分解）
+
+```sql
+-- 全角数字 '１２３' 会被转换为半角 '123'
+SELECT unicode_normalize('１２３ＡＢＣ', 'NFKC') AS result;
+```
+
+```text
++--------+
+| result |
++--------+
+| 123ABC |
++--------+
+```
+
+4. NFKD 处理特殊符号（兼容分解）
+
+```sql
+-- ℃ (摄氏度符号) 会被分解为 °C
+SELECT unicode_normalize('25℃', 'NFKD') AS result;
+```
+
+```text
++--------+
+| result |
++--------+
+| 25°C   |
++--------+
+```
+
+5. 处理带圈数字
+
+```sql
+-- ① ② ③ 等带圈数字会被转换为普通数字
+SELECT unicode_normalize('①②③', 'NFKC') AS result;
+```
+
+```text
++--------+
+| result |
++--------+
+| 123    |
++--------+
+```
+
+6. 比较不同模式对同一字符串的处理
+
+```sql
+SELECT 
+    unicode_normalize('ﬁ', 'NFC') AS nfc_result,
+    unicode_normalize('ﬁ', 'NFKC') AS nfkc_result;
+```
+
+```text
++------------+-------------+
+| nfc_result | nfkc_result |
++------------+-------------+
+| ﬁ          | fi          |
++------------+-------------+
+```
+
+7. 字符串相等性比较场景
+
+```sql
+-- 使用标准化来比较视觉上相同但编码不同的字符串
+SELECT unicode_normalize('café', 'NFC') = unicode_normalize('café', 'NFC') AS 
is_equal;
+```
+
+```text
++----------+
+| is_equal |
++----------+
+|        1 |
++----------+
+```
diff --git a/sidebars.ts b/sidebars.ts
index bb23ee3a7e1..0cc1a526917 100644
--- a/sidebars.ts
+++ b/sidebars.ts
@@ -400,6 +400,7 @@ const sidebars: SidebarsConfig = {
                                 'ai/text-search/search-operators',
                                 'ai/text-search/search-function',
                                 'ai/text-search/custom-analyzer',
+                                'ai/text-search/custom-normalizer',
                                 'ai/text-search/scoring',
                             ],
                         },
@@ -1338,6 +1339,7 @@ const sidebars: SidebarsConfig = {
                                         
'sql-manual/sql-functions/scalar-functions/string-functions/uncompress',
                                         
'sql-manual/sql-functions/scalar-functions/string-functions/unhex',
                                         
'sql-manual/sql-functions/scalar-functions/string-functions/ucase',
+                                        
'sql-manual/sql-functions/scalar-functions/string-functions/unicode_normalize',
                                         
'sql-manual/sql-functions/scalar-functions/string-functions/url-decode',
                                         
'sql-manual/sql-functions/scalar-functions/string-functions/url-encode',
                                         
'sql-manual/sql-functions/scalar-functions/string-functions/uuid',


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: add custom-normalizer and unicode_normalize (#3114)

Reply via email to