(doris-website) branch master updated: [update](tokenize) add tokenize function (#3103)

airborne Mon, 17 Nov 2025 02:10:34 -0800

This is an automated email from the ASF dual-hosted git repository.

airborne pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 5c19057ef35 [update](tokenize) add tokenize function (#3103)
5c19057ef35 is described below

commit 5c19057ef35f05a2c2b292585d4f169c09de9315
Author: Jack <[email protected]>
AuthorDate: Mon Nov 17 18:10:20 2025 +0800

    [update](tokenize) add tokenize function (#3103)
    
    ## Versions
    
    - [x] dev
    - [x] 4.x
    - [x] 3.x
    - [x] 2.1
    
    ## Languages
    
    - [x] Chinese
    - [x] English
    
    ## Docs Checklist
    
    - [ ] Checked by AI
    - [ ] Test Cases Built
---
 .../scalar-functions/string-functions/tokenize.md  | 162 ++++++++++++
 .../scalar-functions/string-functions/tokenize.md  | 162 ++++++++++++
 .../scalar-functions/string-functions/tokenize.md  | 117 +++++++++
 .../scalar-functions/string-functions/tokenize.md  | 284 +++++++++++++++++++++
 .../scalar-functions/string-functions/tokenize.md  | 162 ++++++++++++
 .../scalar-functions/string-functions/tokenize.md  | 117 +++++++++
 .../scalar-functions/string-functions/tokenize.md  | 284 +++++++++++++++++++++
 .../scalar-functions/string-functions/tokenize.md  | 162 ++++++++++++
 8 files changed, 1450 insertions(+)

diff --git 
a/docs/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md 
b/docs/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
index 36a148b499a..7dda53e8101 100644
--- 
a/docs/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
+++ 
b/docs/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
@@ -5,3 +5,165 @@
 }
 ---
 
+## Description
+
+The `TOKENIZE` function tokenizes a string using a specified analyzer and 
returns the tokenization results as a JSON-formatted string array. This 
function is particularly useful for understanding how text will be analyzed 
when using inverted indexes with full-text search capabilities.
+
+## Syntax
+
+```sql
+VARCHAR TOKENIZE(VARCHAR str, VARCHAR properties)
+```
+
+## Parameters
+
+- `str`: The input string to be tokenized. Type: `VARCHAR`
+- `properties`: A property string specifying the analyzer configuration. Type: 
`VARCHAR`
+
+The `properties` parameter supports the following key-value pairs (format: 
`"key1"="value1", "key2"="value2"`):
+
+### Common Properties
+
+| Property | Description | Example Values |
+|----------|-------------|----------------|
+| `built_in_analyzer` | Built-in analyzer type | `"english"`, `"chinese"`, 
`"unicode"`, `"icu"`, `"basic"`, `"ik"`, `"standard"`, `"none"` |
+| `analyzer` | Custom analyzer name (created via `CREATE INVERTED INDEX 
ANALYZER`) | `"my_custom_analyzer"` |
+| `parser_mode` | Parser mode (for chinese analyzers) | `"fine_grained"`, 
`"coarse_grained"` |
+| `support_phrase` | Enable phrase support (stores position information) | 
`"true"`, `"false"` |
+| `lower_case` | Convert tokens to lowercase | `"true"`, `"false"` |
+| `char_filter_type` | Character filter type | Varies by filter |
+| `stop_words` | Stop words configuration | Varies by implementation |
+
+## Return Value
+
+Returns a `VARCHAR` containing a JSON array of tokenization results. Each 
element in the array is an object with the following structure:
+
+- `token`: The tokenized term
+- `position`: (Optional) The position index of the token when `support_phrase` 
is enabled
+
+## Examples
+
+### Example 1: Using built-in analyzers
+
+```sql
+-- Using the standard analyzer
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }]
+```
+
+```sql
+-- Using the english analyzer
+SELECT TOKENIZE("running quickly", '"built_in_analyzer"="english"');
+```
+```
+[{ "token": "run" }, { "token": "quick" }]
+```
+
+```sql
+-- Using the unicode analyzer with Chinese text
+SELECT TOKENIZE("Apache Doris数据库", '"built_in_analyzer"="unicode"');
+```
+```
+[{ "token": "apache" }, { "token": "doris" }, { "token": "数" }, { "token": "据" 
}, { "token": "库" }]
+```
+
+```sql
+-- Using the chinese analyzer
+SELECT TOKENIZE("我来到北京清华大学", '"built_in_analyzer"="chinese"');
+```
+```
+[{ "token": "我" }, { "token": "来到" }, { "token": "北京" }, { "token": "清华大学" }]
+```
+
+```sql
+-- Using the icu analyzer for multilingual text
+SELECT TOKENIZE("Hello World 世界", '"built_in_analyzer"="icu"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }, {"token": "世界"}]
+```
+
+```sql
+-- Using the basic analyzer
+SELECT TOKENIZE("GET /images/hm_bg.jpg HTTP/1.0", 
'"built_in_analyzer"="basic"');
+```
+```
+[{ "token": "get" }, { "token": "images" }, {"token": "hm"}, {"token": "bg"}, 
{"token": "jpg"}, {"token": "http"}, {"token": "1"}, {"token": "0"}]
+```
+
+```sql
+-- Using the ik analyzer for Chinese text
+SELECT TOKENIZE("中华人民共和国国歌", '"built_in_analyzer"="ik"');
+```
+```
+[{ "token": "中华人民共和国" }, { "token": "国歌" }]
+```
+
+### Example 2: Using custom analyzers
+
+First, create a custom analyzer:
+
+```sql
+CREATE INVERTED INDEX ANALYZER lowercase_delimited
+PROPERTIES (
+    "tokenizer" = "standard",
+    "token_filter" = "asciifolding, lowercase"
+);
+```
+
+Then use it with `TOKENIZE`:
+
+```sql
+SELECT TOKENIZE("FOO-BAR", '"analyzer"="lowercase_delimited"');
+```
+```
+[{ "token": "foo" }, { "token": "bar" }]
+```
+
+### Example 3: With phrase support (position information)
+
+```sql
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard", 
"support_phrase"="true"');
+```
+```
+[{ "token": "hello", "position": 0 }, { "token": "world", "position": 1 }]
+```
+
+## Notes
+
+1. **Analyzer Configuration**: The `properties` parameter must be a valid 
property string. If using a custom analyzer, it must be created beforehand 
using `CREATE INVERTED INDEX ANALYZER`.
+
+2. **Supported Analyzers**: Currently supported built-in analyzers include:
+   - `standard`: Standard analyzer for general text
+   - `english`: English language analyzer with stemming
+   - `chinese`: Chinese text analyzer
+   - `unicode`: Unicode-based analyzer for multilingual text
+   - `icu`: ICU-based analyzer for advanced Unicode processing
+   - `basic`: Basic tokenization
+   - `ik`: IK analyzer for Chinese text
+   - `none`: No tokenization (returns original string as single token)
+
+3. **Performance**: The `TOKENIZE` function is primarily intended for testing 
and debugging analyzer configurations. For production full-text search, use 
inverted indexes with the `MATCH` or `SEARCH` operators.
+
+4. **JSON Output**: The output is a formatted JSON string that can be further 
processed using JSON functions if needed.
+
+5. **Compatibility with Inverted Indexes**: The same analyzer configuration 
used in `TOKENIZE` can be applied to inverted indexes when creating tables:
+   ```sql
+   CREATE TABLE example (
+       content TEXT,
+       INDEX idx_content(content) USING INVERTED 
PROPERTIES("analyzer"="my_analyzer")
+   )
+   ```
+
+6. **Testing Analyzer Behavior**: Use `TOKENIZE` to preview how text will be 
tokenized before creating inverted indexes, helping to choose the most 
appropriate analyzer for your data.
+
+## Related Functions
+
+- 
[MATCH](../../../../sql-manual/basic-element/operators/conditional-operators/full-text-search-operators):
 Full-text search using inverted indexes
+- [SEARCH](../../../../ai/text-search/search-function): Advanced search with 
DSL support
+
+## Keywords
+
+TOKENIZE, STRING, FULL-TEXT SEARCH, INVERTED INDEX, ANALYZER
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
index 61e2285f0f6..a996a7890e6 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
@@ -5,3 +5,165 @@
 }
 ---
 
+## 描述
+
+`TOKENIZE` 
函数使用指定的分词器对字符串进行分词,并以JSON格式的字符串数组返回分词结果。该函数特别适用于理解在使用倒排索引进行全文搜索时,文本将如何被分析处理。
+
+## 语法
+
+```sql
+VARCHAR TOKENIZE(VARCHAR str, VARCHAR properties)
+```
+
+## 参数
+
+- `str`: 要进行分词的输入字符串,类型: `VARCHAR`
+- `properties`: 指定分词器配置的属性字符串,类型: `VARCHAR`
+
+`properties` 参数支持以下键值对(格式: `"key1"="value1", "key2"="value2"`):
+
+### 常用属性
+
+| 属性 | 描述 | 示例值 |
+|------|------|--------|
+| `built_in_analyzer` | 内置分词器类型 | `"english"`, `"chinese"`, `"unicode"`, 
`"icu"`, `"basic"`, `"ik"`, `"standard"`, `"none"` |
+| `analyzer` | 自定义分词器名称(通过 `CREATE INVERTED INDEX ANALYZER` 创建) | 
`"my_custom_analyzer"` |
+| `parser_mode` | 分词器模式(用于中文分词器) | `"fine_grained"`, `"coarse_grained"` |
+| `support_phrase` | 启用短语支持(存储位置信息) | `"true"`, `"false"` |
+| `lower_case` | 将词条转换为小写 | `"true"`, `"false"` |
+| `char_filter_type` | 字符过滤器类型 | 根据过滤器而异 |
+| `stop_words` | 停用词配置 | 根据实现而异 |
+
+## 返回值
+
+返回包含分词结果JSON数组的 `VARCHAR` 类型字符串。数组中的每个元素是一个对象,具有以下结构:
+
+- `token`: 分词后的词条
+- `position`: (可选)当启用 `support_phrase` 时,词条的位置索引
+
+## 示例
+
+### 示例 1: 使用内置分词器
+
+```sql
+-- 使用标准分词器
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }]
+```
+
+```sql
+-- 使用英语分词器
+SELECT TOKENIZE("running quickly", '"built_in_analyzer"="english"');
+```
+```
+[{ "token": "run" }, { "token": "quick" }]
+```
+
+```sql
+-- 使用unicode分词器处理中文文本
+SELECT TOKENIZE("Apache Doris数据库", '"built_in_analyzer"="unicode"');
+```
+```
+[{ "token": "apache" }, { "token": "doris" }, { "token": "数" }, { "token": "据" 
}, { "token": "库" }]
+```
+
+```sql
+-- 使用中文分词器
+SELECT TOKENIZE("我来到北京清华大学", '"built_in_analyzer"="chinese"');
+```
+```
+[{ "token": "我" }, { "token": "来到" }, { "token": "北京" }, { "token": "清华大学" }]
+```
+
+```sql
+-- 使用ICU分词器处理多语言文本
+SELECT TOKENIZE("Hello World 世界", '"built_in_analyzer"="icu"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }, {"token": "世界"}]
+```
+
+```sql
+-- 使用基础分词器
+SELECT TOKENIZE("GET /images/hm_bg.jpg HTTP/1.0", 
'"built_in_analyzer"="basic"');
+```
+```
+[{ "token": "get" }, { "token": "images" }, {"token": "hm"}, {"token": "bg"}, 
{"token": "jpg"}, {"token": "http"}, {"token": "1"}, {"token": "0"}]
+```
+
+```sql
+-- 使用IK分词器处理中文文本
+SELECT TOKENIZE("中华人民共和国国歌", '"built_in_analyzer"="ik"');
+```
+```
+[{ "token": "中华人民共和国" }, { "token": "国歌" }]
+```
+
+### 示例 2: 使用自定义分词器
+
+首先创建一个自定义分词器:
+
+```sql
+CREATE INVERTED INDEX ANALYZER lowercase_delimited
+PROPERTIES (
+    "tokenizer" = "standard",
+    "token_filter" = "asciifolding, lowercase"
+);
+```
+
+然后在 `TOKENIZE` 中使用:
+
+```sql
+SELECT TOKENIZE("FOO-BAR", '"analyzer"="lowercase_delimited"');
+```
+```
+[{ "token": "foo" }, { "token": "bar" }]
+```
+
+### 示例 3: 启用短语支持(位置信息)
+
+```sql
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard", 
"support_phrase"="true"');
+```
+```
+[{ "token": "hello", "position": 0 }, { "token": "world", "position": 1 }]
+```
+
+## 注意事项
+
+1. **分词器配置**: `properties` 参数必须是有效的属性字符串。如果使用自定义分词器,必须先使用 `CREATE INVERTED 
INDEX ANALYZER` 创建。
+
+2. **支持的分词器**: 当前支持的内置分词器包括:
+   - `standard`: 标准分词器,用于通用文本
+   - `english`: 带词干提取的英语分词器
+   - `chinese`: 中文文本分词器
+   - `unicode`: 基于Unicode的多语言文本分词器
+   - `icu`: 基于ICU的高级Unicode处理分词器
+   - `basic`: 基础分词
+   - `ik`: 中文IK分词器
+   - `none`: 不分词(返回原始字符串作为单个词条)
+
+3. **性能考虑**: `TOKENIZE` 函数主要用于测试和调试分词器配置。在生产环境的全文搜索中,应使用带有 `MATCH` 或 `SEARCH` 
操作符的倒排索引。
+
+4. **JSON输出**: 输出是格式化的JSON字符串,如需进一步处理,可以使用JSON函数。
+
+5. **与倒排索引的兼容性**: 在 `TOKENIZE` 中使用的相同分词器配置可以应用于创建表时的倒排索引:
+   ```sql
+   CREATE TABLE example (
+       content TEXT,
+       INDEX idx_content(content) USING INVERTED 
PROPERTIES("analyzer"="my_analyzer")
+   )
+   ```
+
+6. **测试分词器行为**: 使用 `TOKENIZE` 可以在创建倒排索引之前预览文本的分词效果,有助于为您的数据选择最合适的分词器。
+
+## 相关函数
+
+- 
[MATCH](../../../../sql-manual/basic-element/operators/conditional-operators/full-text-search-operators):
 使用倒排索引进行全文搜索
+- [SEARCH](../../../../ai/text-search/search-function): 支持DSL的高级搜索
+
+## 关键字
+
+TOKENIZE, STRING, 全文搜索, 倒排索引, 分词器
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
index 61e2285f0f6..5f775ffc0c3 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
@@ -5,3 +5,120 @@
 }
 ---
 
+## 描述
+
+`TOKENIZE` 
函数使用指定的分词器对字符串进行分词,并以字符串数组形式返回分词结果。该函数特别适用于测试和理解在使用倒排索引进行全文搜索时,文本将如何被分析处理。
+
+## 语法
+
+```sql
+ARRAY<VARCHAR> TOKENIZE(VARCHAR str, VARCHAR properties)
+```
+
+## 参数
+
+- `str`: 要进行分词的输入字符串,类型: `VARCHAR`
+- `properties`: 指定分词器配置的属性字符串,类型: `VARCHAR`
+
+`properties` 参数支持以下键值对(格式: `'key1'='value1', 'key2'='value2'` 或 
`"key1"="value1", "key2"="value2"`):
+
+### 支持的属性
+
+| 属性 | 描述 | 示例值 |
+|------|------|--------|
+| `parser` | 内置分词器类型 | `"chinese"`, `"english"`, `"unicode"` |
+| `parser_mode` | 中文分词器的分词模式 | `"fine_grained"`, `"coarse_grained"` |
+| `char_filter_type` | 字符过滤器类型 | `"char_replace"` |
+| `char_filter_pattern` | 要替换的字符(与 `char_filter_type` 配合使用) | `"._=:,"` |
+| `char_filter_replacement` | 替换字符(与 `char_filter_type` 配合使用) | `" "` (空格) |
+| `stopwords` | 停用词配置 | `"none"` |
+
+## 返回值
+
+返回 `ARRAY<VARCHAR>` 类型,包含分词后的字符串数组。
+
+## 示例
+
+### 示例 1: 使用中文分词器
+
+```sql
+SELECT TOKENIZE('我来到北京清华大学', "'parser'='chinese'");
+```
+```
+["我", "来到", "北京", "清华大学"]
+```
+
+### 示例 2: 中文分词器的细粒度模式
+
+```sql
+SELECT TOKENIZE('我来到北京清华大学', "'parser'='chinese', 
'parser_mode'='fine_grained'");
+```
+```
+["我", "来到", "北京", "清华", "清华大学", "华大", "大学"]
+```
+
+### 示例 3: 使用 Unicode 分词器
+
+```sql
+SELECT TOKENIZE('Apache Doris数据库', "'parser'='unicode'");
+```
+```
+["apache", "doris", "数", "据", "库"]
+```
+
+### 示例 4: 使用字符过滤器
+
+```sql
+SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0 test:abc=bcd',
+    '"parser"="unicode","char_filter_type" = 
"char_replace","char_filter_pattern" = "._=:,","char_filter_replacement" = " 
"');
+```
+```
+["get", "images", "hm", "bg", "jpg", "http", "1", "0", "test", "abc", "bcd"]
+```
+
+### 示例 5: 停用词配置
+
+```sql
+SELECT TOKENIZE('华夏智胜新税股票A', '"parser"="unicode"');
+```
+```
+["华", "夏", "智", "胜", "新", "税", "股", "票"]
+```
+
+```sql
+SELECT TOKENIZE('华夏智胜新税股票A', '"parser"="unicode","stopwords" = "none"');
+```
+```
+["华", "夏", "智", "胜", "新", "税", "股", "票", "a"]
+```
+
+## 注意事项
+
+1. **分词器配置**: `properties` 参数必须是有效的属性字符串。此版本仅支持内置分词器。
+
+2. **支持的分词器**: 2.1 版本支持以下内置分词器:
+   - `chinese`: 中文文本分词器,支持可选的 `parser_mode`(`fine_grained` 或 `coarse_grained`)
+   - `english`: 带词干提取的英语分词器
+   - `unicode`: 基于 Unicode 的多语言文本分词器
+
+3. **分词模式**: `parser_mode` 属性主要用于 `chinese` 分词器:
+   - `fine_grained`: 细粒度模式,生成更详细的词条,包含重叠片段
+   - `coarse_grained`: 粗粒度模式(默认),标准分词
+
+4. **字符过滤器**: 需要同时使用 `char_filter_type`、`char_filter_pattern` 和 
`char_filter_replacement` 来在分词前替换特定字符。
+
+5. **性能考虑**: `TOKENIZE` 函数主要用于测试和调试分词器配置。在生产环境的全文搜索中,应使用带有 `MATCH` 谓词的倒排索引。
+
+6. **与倒排索引的兼容性**: 在 `TOKENIZE` 中使用的相同分词器配置可以应用于创建表时的倒排索引:
+   ```sql
+   CREATE TABLE example (
+       content TEXT,
+       INDEX idx_content(content) USING INVERTED PROPERTIES("parser"="chinese")
+   )
+   ```
+
+7. **测试分词器行为**: 使用 `TOKENIZE` 可以在创建倒排索引之前预览文本的分词效果,有助于为您的数据选择最合适的分词器。
+
+## 关键字
+
+TOKENIZE, STRING, 全文搜索, 倒排索引, 分词器
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
index 61e2285f0f6..10023db5d13 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
@@ -5,3 +5,287 @@
 }
 ---
 
+## 描述
+
+`TOKENIZE` 函数使用指定的分词器对字符串进行分词,并返回分词结果。该函数特别适用于测试和理解在使用倒排索引进行全文搜索时,文本将如何被分析处理。
+
+:::tip 版本差异
+`TOKENIZE` 函数在 3.0 和 3.1+ 版本之间存在行为差异:
+- **3.0 版本**: 使用 `parser` 参数,返回简单字符串数组
+- **3.1+ 版本**: 支持 `built_in_analyzer` 和自定义 `analyzer`,返回 JSON 对象数组,功能更强大
+
+关于 3.0 版本的具体用法,请参见 [3.0 版本特性](#30-版本特性) 章节。
+:::
+
+---
+
+## 3.1+ 版本特性 (推荐)
+
+### 语法
+
+```sql
+VARCHAR TOKENIZE(VARCHAR str, VARCHAR properties)
+```
+
+### 参数
+
+- `str`: 要进行分词的输入字符串,类型: `VARCHAR`
+- `properties`: 指定分词器配置的属性字符串,类型: `VARCHAR`
+
+`properties` 参数支持以下键值对(格式: `"key1"="value1", "key2"="value2"`):
+
+| 属性 | 描述 | 示例值 |
+|------|------|--------|
+| `built_in_analyzer` | 内置分词器类型 | `"standard"`, `"english"`, `"chinese"`, 
`"unicode"`, `"icu"`, `"basic"`, `"ik"`, `"none"` |
+| `analyzer` | 自定义分词器名称(通过 `CREATE INVERTED INDEX ANALYZER` 创建) | 
`"my_custom_analyzer"` |
+| `parser` | 内置分词器类型(向后兼容) | `"chinese"`, `"english"`, `"unicode"` |
+| `parser_mode` | 中文分词器的分词模式 | `"fine_grained"`, `"coarse_grained"` |
+| `support_phrase` | 启用短语支持(存储位置信息) | `"true"`, `"false"` |
+| `lower_case` | 将词条转换为小写 | `"true"`, `"false"` |
+| `char_filter_type` | 字符过滤器类型 | `"char_replace"` |
+| `char_filter_pattern` | 要替换的字符(与 `char_filter_type` 配合使用) | `"._=:,"` |
+| `char_filter_replacement` | 替换字符(与 `char_filter_type` 配合使用) | `" "` (空格) |
+| `stopwords` | 停用词配置 | `"none"` |
+
+### 返回值
+
+返回包含分词结果 JSON 数组的 `VARCHAR` 类型字符串。数组中的每个元素是一个对象,具有以下结构:
+- `token`: 分词后的词条
+- `position`: (可选)当启用 `support_phrase` 时,词条的位置索引
+
+### 示例
+
+#### 示例 1: 使用内置分词器
+
+```sql
+-- 标准分词器
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }]
+```
+
+```sql
+-- 英语分词器(带词干提取)
+SELECT TOKENIZE("running quickly", '"built_in_analyzer"="english"');
+```
+```
+[{ "token": "run" }, { "token": "quick" }]
+```
+
+```sql
+-- 中文分词器
+SELECT TOKENIZE('我来到北京清华大学', '"built_in_analyzer"="chinese"');
+```
+```
+[{ "token": "我" }, { "token": "来到" }, { "token": "北京" }, { "token": "清华大学" }]
+```
+
+```sql
+-- Unicode 分词器
+SELECT TOKENIZE('Apache Doris数据库', '"built_in_analyzer"="unicode"');
+```
+```
+[{ "token": "apache" }, { "token": "doris" }, { "token": "数" }, { "token": "据" 
}, { "token": "库" }]
+```
+
+```sql
+-- ICU 分词器处理多语言文本
+SELECT TOKENIZE("Hello World 世界", '"built_in_analyzer"="icu"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }, { "token": "世界" }]
+```
+
+```sql
+-- 基础分词器
+SELECT TOKENIZE("GET /images/hm_bg.jpg HTTP/1.0", 
'"built_in_analyzer"="basic"');
+```
+```
+[{ "token": "get" }, { "token": "images" }, { "token": "hm" }, { "token": "bg" 
}, { "token": "jpg" }, { "token": "http" }, { "token": "1" }, { "token": "0" }]
+```
+
+```sql
+-- IK 分词器处理中文文本
+SELECT TOKENIZE("中华人民共和国国歌", '"built_in_analyzer"="ik"');
+```
+```
+[{ "token": "中华人民共和国" }, { "token": "国歌" }]
+```
+
+#### 示例 2: 中文分词器的细粒度模式
+
+```sql
+SELECT TOKENIZE('我来到北京清华大学', '"built_in_analyzer"="chinese", 
"parser_mode"="fine_grained"');
+```
+```
+[{ "token": "我" }, { "token": "来到" }, { "token": "北京" }, { "token": "清华" }, { 
"token": "清华大学" }, { "token": "华大" }, { "token": "大学" }]
+```
+
+#### 示例 3: 使用字符过滤器
+
+```sql
+SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0 test:abc=bcd',
+    '"built_in_analyzer"="unicode","char_filter_type" = 
"char_replace","char_filter_pattern" = "._=:,","char_filter_replacement" = " 
"');
+```
+```
+[{ "token": "get" }, { "token": "images" }, { "token": "hm" }, { "token": "bg" 
}, { "token": "jpg" }, { "token": "http" }, { "token": "1" }, { "token": "0" }, 
{ "token": "test" }, { "token": "abc" }, { "token": "bcd" }]
+```
+
+#### 示例 4: 启用短语支持(位置信息)
+
+```sql
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard", 
"support_phrase"="true"');
+```
+```
+[{ "token": "hello", "position": 0 }, { "token": "world", "position": 1 }]
+```
+
+#### 示例 5: 使用自定义分词器
+
+首先创建一个自定义分词器:
+
+```sql
+CREATE INVERTED INDEX ANALYZER lowercase_delimited
+PROPERTIES (
+    "tokenizer" = "standard",
+    "token_filter" = "asciifolding, lowercase"
+);
+```
+
+然后在 `TOKENIZE` 中使用:
+
+```sql
+SELECT TOKENIZE("FOO-BAR", '"analyzer"="lowercase_delimited"');
+```
+```
+[{ "token": "foo" }, { "token": "bar" }]
+```
+
+---
+
+## 3.0 版本特性
+
+:::info
+3.0 版本的功能相比 3.1+ 版本有所限制,建议升级到 3.1+ 以获得增强功能。
+:::
+
+### 语法
+
+```sql
+ARRAY<VARCHAR> TOKENIZE(VARCHAR str, VARCHAR properties)
+```
+
+### 参数
+
+3.0 版本的 `properties` 参数支持:
+
+| 属性 | 描述 | 示例值 |
+|------|------|--------|
+| `parser` | 内置分词器类型 | `"chinese"`, `"english"`, `"unicode"` |
+| `parser_mode` | 中文分词器的分词模式 | `"fine_grained"`, `"coarse_grained"` |
+| `char_filter_type` | 字符过滤器类型 | `"char_replace"` |
+| `char_filter_pattern` | 要替换的字符 | `"._=:,"` |
+| `char_filter_replacement` | 替换字符 | `" "` (空格) |
+| `stopwords` | 停用词配置 | `"none"` |
+
+**3.0 版本不支持:**
+- `built_in_analyzer` 参数
+- `analyzer` 参数(自定义分词器)
+- `support_phrase` 参数
+- `lower_case` 参数
+- 额外的分词器: `icu`, `basic`, `ik`, `standard`
+
+### 返回值
+
+返回 `ARRAY<VARCHAR>` 类型,包含分词后的字符串数组(简单字符串数组,不是 JSON 对象)。
+
+### 示例
+
+#### 示例 1: 使用中文分词器
+
+```sql
+SELECT TOKENIZE('我来到北京清华大学', "'parser'='chinese'");
+```
+```
+["我", "来到", "北京", "清华大学"]
+```
+
+#### 示例 2: 中文分词器的细粒度模式
+
+```sql
+SELECT TOKENIZE('我来到北京清华大学', "'parser'='chinese', 
'parser_mode'='fine_grained'");
+```
+```
+["我", "来到", "北京", "清华", "清华大学", "华大", "大学"]
+```
+
+#### 示例 3: 使用 Unicode 分词器
+
+```sql
+SELECT TOKENIZE('Apache Doris数据库', "'parser'='unicode'");
+```
+```
+["apache", "doris", "数", "据", "库"]
+```
+
+#### 示例 4: 使用字符过滤器
+
+```sql
+SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0 test:abc=bcd',
+    '"parser"="unicode","char_filter_type" = 
"char_replace","char_filter_pattern" = "._=:,","char_filter_replacement" = " 
"');
+```
+```
+["get", "images", "hm", "bg", "jpg", "http", "1", "0", "test", "abc", "bcd"]
+```
+
+#### 示例 5: 停用词配置
+
+```sql
+SELECT TOKENIZE('华夏智胜新税股票A', '"parser"="unicode"');
+```
+```
+["华", "夏", "智", "胜", "新", "税", "股", "票"]
+```
+
+```sql
+SELECT TOKENIZE('华夏智胜新税股票A', '"parser"="unicode","stopwords" = "none"');
+```
+```
+["华", "夏", "智", "胜", "新", "税", "股", "票", "a"]
+```
+
+---
+
+## 注意事项
+
+1. **版本兼容性**:
+   - 3.0 版本使用 `parser` 参数,返回简单字符串数组
+   - 3.1+ 版本同时支持 `parser`(向后兼容) 和 `built_in_analyzer`,返回 JSON 对象数组
+   - 3.1+ 版本新增自定义分词器、更多内置分词器和短语支持功能
+
+2. **支持的分词器**:
+   - **3.0 版本**: `chinese`, `english`, `unicode`
+   - **3.1+ 版本**: `standard`, `english`, `chinese`, `unicode`, `icu`, `basic`, 
`ik`, `none`
+
+3. **分词模式**: `parser_mode` 属性主要用于 `chinese` 分词器:
+   - `fine_grained`: 细粒度模式,生成更详细的词条,包含重叠片段
+   - `coarse_grained`: 粗粒度模式(默认),标准分词
+
+4. **字符过滤器**: 需要同时使用 `char_filter_type`、`char_filter_pattern` 和 
`char_filter_replacement` 来在分词前替换特定字符。
+
+5. **性能考虑**: `TOKENIZE` 函数主要用于测试和调试分词器配置。在生产环境的全文搜索中,应使用带有 `MATCH` 谓词的倒排索引。
+
+6. **与倒排索引的兼容性**: 在 `TOKENIZE` 中使用的相同分词器配置可以应用于创建表时的倒排索引:
+   ```sql
+   CREATE TABLE example (
+       content TEXT,
+       INDEX idx_content(content) USING INVERTED PROPERTIES("parser"="chinese")
+   )
+   ```
+
+7. **测试分词器行为**: 使用 `TOKENIZE` 可以在创建倒排索引之前预览文本的分词效果,有助于为您的数据选择最合适的分词器。
+
+## 关键字
+
+TOKENIZE, STRING, 全文搜索, 倒排索引, 分词器
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
index 61e2285f0f6..a996a7890e6 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
@@ -5,3 +5,165 @@
 }
 ---
 
+## 描述
+
+`TOKENIZE` 
函数使用指定的分词器对字符串进行分词,并以JSON格式的字符串数组返回分词结果。该函数特别适用于理解在使用倒排索引进行全文搜索时,文本将如何被分析处理。
+
+## 语法
+
+```sql
+VARCHAR TOKENIZE(VARCHAR str, VARCHAR properties)
+```
+
+## 参数
+
+- `str`: 要进行分词的输入字符串,类型: `VARCHAR`
+- `properties`: 指定分词器配置的属性字符串,类型: `VARCHAR`
+
+`properties` 参数支持以下键值对(格式: `"key1"="value1", "key2"="value2"`):
+
+### 常用属性
+
+| 属性 | 描述 | 示例值 |
+|------|------|--------|
+| `built_in_analyzer` | 内置分词器类型 | `"english"`, `"chinese"`, `"unicode"`, 
`"icu"`, `"basic"`, `"ik"`, `"standard"`, `"none"` |
+| `analyzer` | 自定义分词器名称(通过 `CREATE INVERTED INDEX ANALYZER` 创建) | 
`"my_custom_analyzer"` |
+| `parser_mode` | 分词器模式(用于中文分词器) | `"fine_grained"`, `"coarse_grained"` |
+| `support_phrase` | 启用短语支持(存储位置信息) | `"true"`, `"false"` |
+| `lower_case` | 将词条转换为小写 | `"true"`, `"false"` |
+| `char_filter_type` | 字符过滤器类型 | 根据过滤器而异 |
+| `stop_words` | 停用词配置 | 根据实现而异 |
+
+## 返回值
+
+返回包含分词结果JSON数组的 `VARCHAR` 类型字符串。数组中的每个元素是一个对象,具有以下结构:
+
+- `token`: 分词后的词条
+- `position`: (可选)当启用 `support_phrase` 时,词条的位置索引
+
+## 示例
+
+### 示例 1: 使用内置分词器
+
+```sql
+-- 使用标准分词器
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }]
+```
+
+```sql
+-- 使用英语分词器
+SELECT TOKENIZE("running quickly", '"built_in_analyzer"="english"');
+```
+```
+[{ "token": "run" }, { "token": "quick" }]
+```
+
+```sql
+-- 使用unicode分词器处理中文文本
+SELECT TOKENIZE("Apache Doris数据库", '"built_in_analyzer"="unicode"');
+```
+```
+[{ "token": "apache" }, { "token": "doris" }, { "token": "数" }, { "token": "据" 
}, { "token": "库" }]
+```
+
+```sql
+-- 使用中文分词器
+SELECT TOKENIZE("我来到北京清华大学", '"built_in_analyzer"="chinese"');
+```
+```
+[{ "token": "我" }, { "token": "来到" }, { "token": "北京" }, { "token": "清华大学" }]
+```
+
+```sql
+-- 使用ICU分词器处理多语言文本
+SELECT TOKENIZE("Hello World 世界", '"built_in_analyzer"="icu"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }, {"token": "世界"}]
+```
+
+```sql
+-- 使用基础分词器
+SELECT TOKENIZE("GET /images/hm_bg.jpg HTTP/1.0", 
'"built_in_analyzer"="basic"');
+```
+```
+[{ "token": "get" }, { "token": "images" }, {"token": "hm"}, {"token": "bg"}, 
{"token": "jpg"}, {"token": "http"}, {"token": "1"}, {"token": "0"}]
+```
+
+```sql
+-- 使用IK分词器处理中文文本
+SELECT TOKENIZE("中华人民共和国国歌", '"built_in_analyzer"="ik"');
+```
+```
+[{ "token": "中华人民共和国" }, { "token": "国歌" }]
+```
+
+### 示例 2: 使用自定义分词器
+
+首先创建一个自定义分词器:
+
+```sql
+CREATE INVERTED INDEX ANALYZER lowercase_delimited
+PROPERTIES (
+    "tokenizer" = "standard",
+    "token_filter" = "asciifolding, lowercase"
+);
+```
+
+然后在 `TOKENIZE` 中使用:
+
+```sql
+SELECT TOKENIZE("FOO-BAR", '"analyzer"="lowercase_delimited"');
+```
+```
+[{ "token": "foo" }, { "token": "bar" }]
+```
+
+### 示例 3: 启用短语支持(位置信息)
+
+```sql
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard", 
"support_phrase"="true"');
+```
+```
+[{ "token": "hello", "position": 0 }, { "token": "world", "position": 1 }]
+```
+
+## 注意事项
+
+1. **分词器配置**: `properties` 参数必须是有效的属性字符串。如果使用自定义分词器,必须先使用 `CREATE INVERTED 
INDEX ANALYZER` 创建。
+
+2. **支持的分词器**: 当前支持的内置分词器包括:
+   - `standard`: 标准分词器,用于通用文本
+   - `english`: 带词干提取的英语分词器
+   - `chinese`: 中文文本分词器
+   - `unicode`: 基于Unicode的多语言文本分词器
+   - `icu`: 基于ICU的高级Unicode处理分词器
+   - `basic`: 基础分词
+   - `ik`: 中文IK分词器
+   - `none`: 不分词(返回原始字符串作为单个词条)
+
+3. **性能考虑**: `TOKENIZE` 函数主要用于测试和调试分词器配置。在生产环境的全文搜索中,应使用带有 `MATCH` 或 `SEARCH` 
操作符的倒排索引。
+
+4. **JSON输出**: 输出是格式化的JSON字符串,如需进一步处理,可以使用JSON函数。
+
+5. **与倒排索引的兼容性**: 在 `TOKENIZE` 中使用的相同分词器配置可以应用于创建表时的倒排索引:
+   ```sql
+   CREATE TABLE example (
+       content TEXT,
+       INDEX idx_content(content) USING INVERTED 
PROPERTIES("analyzer"="my_analyzer")
+   )
+   ```
+
+6. **测试分词器行为**: 使用 `TOKENIZE` 可以在创建倒排索引之前预览文本的分词效果,有助于为您的数据选择最合适的分词器。
+
+## 相关函数
+
+- 
[MATCH](../../../../sql-manual/basic-element/operators/conditional-operators/full-text-search-operators):
 使用倒排索引进行全文搜索
+- [SEARCH](../../../../ai/text-search/search-function): 支持DSL的高级搜索
+
+## 关键字
+
+TOKENIZE, STRING, 全文搜索, 倒排索引, 分词器
diff --git 
a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
 
b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
index 36a148b499a..09bd7017122 100644
--- 
a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
+++ 
b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
@@ -5,3 +5,120 @@
 }
 ---
 
+## Description
+
+The `TOKENIZE` function tokenizes a string using a specified parser and 
returns the tokenization results as a string array. This function is 
particularly useful for testing and understanding how text will be analyzed 
when using inverted indexes with full-text search capabilities.
+
+## Syntax
+
+```sql
+ARRAY<VARCHAR> TOKENIZE(VARCHAR str, VARCHAR properties)
+```
+
+## Parameters
+
+- `str`: The input string to be tokenized. Type: `VARCHAR`
+- `properties`: A property string specifying the parser configuration. Type: 
`VARCHAR`
+
+The `properties` parameter supports the following key-value pairs (format: 
`'key1'='value1', 'key2'='value2'` or `"key1"="value1", "key2"="value2"`):
+
+### Supported Properties
+
+| Property | Description | Example Values |
+|----------|-------------|----------------|
+| `parser` | Built-in parser type | `"chinese"`, `"english"`, `"unicode"` |
+| `parser_mode` | Parser mode for Chinese parser | `"fine_grained"`, 
`"coarse_grained"` |
+| `char_filter_type` | Character filter type | `"char_replace"` |
+| `char_filter_pattern` | Characters to be replaced (used with 
`char_filter_type`) | `"._=:,"` |
+| `char_filter_replacement` | Replacement character (used with 
`char_filter_type`) | `" "` (space) |
+| `stopwords` | Stop words configuration | `"none"` |
+
+## Return Value
+
+Returns an `ARRAY<VARCHAR>` containing the tokenized strings as individual 
array elements.
+
+## Examples
+
+### Example 1: Using the Chinese parser
+
+```sql
+SELECT TOKENIZE('我来到北京清华大学', "'parser'='chinese'");
+```
+```
+["我", "来到", "北京", "清华大学"]
+```
+
+### Example 2: Chinese parser with fine-grained mode
+
+```sql
+SELECT TOKENIZE('我来到北京清华大学', "'parser'='chinese', 
'parser_mode'='fine_grained'");
+```
+```
+["我", "来到", "北京", "清华", "清华大学", "华大", "大学"]
+```
+
+### Example 3: Using the Unicode parser
+
+```sql
+SELECT TOKENIZE('Apache Doris数据库', "'parser'='unicode'");
+```
+```
+["apache", "doris", "数", "据", "库"]
+```
+
+### Example 4: Using character filters
+
+```sql
+SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0 test:abc=bcd',
+    '"parser"="unicode","char_filter_type" = 
"char_replace","char_filter_pattern" = "._=:,","char_filter_replacement" = " 
"');
+```
+```
+["get", "images", "hm", "bg", "jpg", "http", "1", "0", "test", "abc", "bcd"]
+```
+
+### Example 5: Stopwords configuration
+
+```sql
+SELECT TOKENIZE('华夏智胜新税股票A', '"parser"="unicode"');
+```
+```
+["华", "夏", "智", "胜", "新", "税", "股", "票"]
+```
+
+```sql
+SELECT TOKENIZE('华夏智胜新税股票A', '"parser"="unicode","stopwords" = "none"');
+```
+```
+["华", "夏", "智", "胜", "新", "税", "股", "票", "a"]
+```
+
+## Notes
+
+1. **Parser Configuration**: The `properties` parameter must be a valid 
property string. Only built-in parsers are supported in this version.
+
+2. **Supported Parsers**: Version 2.1 supports the following built-in parsers:
+   - `chinese`: Chinese text parser with optional `parser_mode` 
(`fine_grained` or `coarse_grained`)
+   - `english`: English language parser with stemming
+   - `unicode`: Unicode-based parser for multilingual text
+
+3. **Parser Mode**: The `parser_mode` property is primarily used with the 
`chinese` parser:
+   - `fine_grained`: Produces more detailed tokens with overlapping segments
+   - `coarse_grained`: Default mode with standard segmentation
+
+4. **Character Filters**: Use `char_filter_type`, `char_filter_pattern`, and 
`char_filter_replacement` together to replace specific characters before 
tokenization.
+
+5. **Performance**: The `TOKENIZE` function is primarily intended for testing 
and debugging parser configurations. For production full-text search, use 
inverted indexes with the `MATCH` predicate.
+
+6. **Compatibility with Inverted Indexes**: The same parser configuration used 
in `TOKENIZE` can be applied to inverted indexes when creating tables:
+   ```sql
+   CREATE TABLE example (
+       content TEXT,
+       INDEX idx_content(content) USING INVERTED PROPERTIES("parser"="chinese")
+   )
+   ```
+
+7. **Testing Parser Behavior**: Use `TOKENIZE` to preview how text will be 
tokenized before creating inverted indexes, helping to choose the most 
appropriate parser for your data.
+
+## Keywords
+
+TOKENIZE, STRING, FULL-TEXT SEARCH, INVERTED INDEX, PARSER
diff --git 
a/versioned_docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
 
b/versioned_docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
index 36a148b499a..bb05b517168 100644
--- 
a/versioned_docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
+++ 
b/versioned_docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
@@ -5,3 +5,287 @@
 }
 ---
 
+## Description
+
+The `TOKENIZE` function tokenizes a string using a specified parser/analyzer 
and returns the tokenization results. This function is particularly useful for 
testing and understanding how text will be analyzed when using inverted indexes 
with full-text search capabilities.
+
+:::tip Version Differences
+The behavior of `TOKENIZE` differs between version 3.0 and 3.1+:
+- **Version 3.0**: Uses `parser` parameter, returns simple string array
+- **Version 3.1+**: Supports `built_in_analyzer` and custom `analyzer`, 
returns JSON object array with enhanced features
+
+See the [Version 3.0 Specific Features](#version-30-specific-features) section 
for details on version 3.0 usage.
+:::
+
+---
+
+## Version 3.1+ Features (Recommended)
+
+### Syntax
+
+```sql
+VARCHAR TOKENIZE(VARCHAR str, VARCHAR properties)
+```
+
+### Parameters
+
+- `str`: The input string to be tokenized. Type: `VARCHAR`
+- `properties`: A property string specifying the analyzer configuration. Type: 
`VARCHAR`
+
+The `properties` parameter supports the following key-value pairs (format: 
`"key1"="value1", "key2"="value2"`):
+
+| Property | Description | Example Values |
+|----------|-------------|----------------|
+| `built_in_analyzer` | Built-in analyzer type | `"standard"`, `"english"`, 
`"chinese"`, `"unicode"`, `"icu"`, `"basic"`, `"ik"`, `"none"` |
+| `analyzer` | Custom analyzer name (created via `CREATE INVERTED INDEX 
ANALYZER`) | `"my_custom_analyzer"` |
+| `parser` | Built-in parser type (backward compatible) | `"chinese"`, 
`"english"`, `"unicode"` |
+| `parser_mode` | Parser mode for Chinese parser | `"fine_grained"`, 
`"coarse_grained"` |
+| `support_phrase` | Enable phrase support (stores position information) | 
`"true"`, `"false"` |
+| `lower_case` | Convert tokens to lowercase | `"true"`, `"false"` |
+| `char_filter_type` | Character filter type | `"char_replace"` |
+| `char_filter_pattern` | Characters to be replaced (used with 
`char_filter_type`) | `"._=:,"` |
+| `char_filter_replacement` | Replacement character (used with 
`char_filter_type`) | `" "` (space) |
+| `stopwords` | Stop words configuration | `"none"` |
+
+### Return Value
+
+Returns a `VARCHAR` containing a JSON array of tokenization results. Each 
element in the array is an object with the following structure:
+- `token`: The tokenized term
+- `position`: (Optional) The position index of the token when `support_phrase` 
is enabled
+
+### Examples
+
+#### Example 1: Using built-in analyzers
+
+```sql
+-- Standard analyzer
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }]
+```
+
+```sql
+-- English analyzer with stemming
+SELECT TOKENIZE("running quickly", '"built_in_analyzer"="english"');
+```
+```
+[{ "token": "run" }, { "token": "quick" }]
+```
+
+```sql
+-- Chinese analyzer
+SELECT TOKENIZE('我来到北京清华大学', '"built_in_analyzer"="chinese"');
+```
+```
+[{ "token": "我" }, { "token": "来到" }, { "token": "北京" }, { "token": "清华大学" }]
+```
+
+```sql
+-- Unicode analyzer
+SELECT TOKENIZE('Apache Doris数据库', '"built_in_analyzer"="unicode"');
+```
+```
+[{ "token": "apache" }, { "token": "doris" }, { "token": "数" }, { "token": "据" 
}, { "token": "库" }]
+```
+
+```sql
+-- ICU analyzer for multilingual text
+SELECT TOKENIZE("Hello World 世界", '"built_in_analyzer"="icu"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }, { "token": "世界" }]
+```
+
+```sql
+-- Basic analyzer
+SELECT TOKENIZE("GET /images/hm_bg.jpg HTTP/1.0", 
'"built_in_analyzer"="basic"');
+```
+```
+[{ "token": "get" }, { "token": "images" }, { "token": "hm" }, { "token": "bg" 
}, { "token": "jpg" }, { "token": "http" }, { "token": "1" }, { "token": "0" }]
+```
+
+```sql
+-- IK analyzer for Chinese text
+SELECT TOKENIZE("中华人民共和国国歌", '"built_in_analyzer"="ik"');
+```
+```
+[{ "token": "中华人民共和国" }, { "token": "国歌" }]
+```
+
+#### Example 2: Chinese parser with fine-grained mode
+
+```sql
+SELECT TOKENIZE('我来到北京清华大学', '"built_in_analyzer"="chinese", 
"parser_mode"="fine_grained"');
+```
+```
+[{ "token": "我" }, { "token": "来到" }, { "token": "北京" }, { "token": "清华" }, { 
"token": "清华大学" }, { "token": "华大" }, { "token": "大学" }]
+```
+
+#### Example 3: Using character filters
+
+```sql
+SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0 test:abc=bcd',
+    '"built_in_analyzer"="unicode","char_filter_type" = 
"char_replace","char_filter_pattern" = "._=:,","char_filter_replacement" = " 
"');
+```
+```
+[{ "token": "get" }, { "token": "images" }, { "token": "hm" }, { "token": "bg" 
}, { "token": "jpg" }, { "token": "http" }, { "token": "1" }, { "token": "0" }, 
{ "token": "test" }, { "token": "abc" }, { "token": "bcd" }]
+```
+
+#### Example 4: With phrase support (position information)
+
+```sql
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard", 
"support_phrase"="true"');
+```
+```
+[{ "token": "hello", "position": 0 }, { "token": "world", "position": 1 }]
+```
+
+#### Example 5: Using custom analyzers
+
+First, create a custom analyzer:
+
+```sql
+CREATE INVERTED INDEX ANALYZER lowercase_delimited
+PROPERTIES (
+    "tokenizer" = "standard",
+    "token_filter" = "asciifolding, lowercase"
+);
+```
+
+Then use it with `TOKENIZE`:
+
+```sql
+SELECT TOKENIZE("FOO-BAR", '"analyzer"="lowercase_delimited"');
+```
+```
+[{ "token": "foo" }, { "token": "bar" }]
+```
+
+---
+
+## Version 3.0 Specific Features
+
+:::info
+Version 3.0 has limited functionality compared to 3.1+. It's recommended to 
upgrade to 3.1+ for enhanced features.
+:::
+
+### Syntax
+
+```sql
+ARRAY<VARCHAR> TOKENIZE(VARCHAR str, VARCHAR properties)
+```
+
+### Parameters
+
+The `properties` parameter in version 3.0 supports:
+
+| Property | Description | Example Values |
+|----------|-------------|----------------|
+| `parser` | Built-in parser type | `"chinese"`, `"english"`, `"unicode"` |
+| `parser_mode` | Parser mode for Chinese parser | `"fine_grained"`, 
`"coarse_grained"` |
+| `char_filter_type` | Character filter type | `"char_replace"` |
+| `char_filter_pattern` | Characters to be replaced | `"._=:,"` |
+| `char_filter_replacement` | Replacement character | `" "` (space) |
+| `stopwords` | Stop words configuration | `"none"` |
+
+**Not supported in version 3.0:**
+- `built_in_analyzer` parameter
+- `analyzer` parameter (custom analyzers)
+- `support_phrase` parameter
+- `lower_case` parameter
+- Additional analyzers: `icu`, `basic`, `ik`, `standard`
+
+### Return Value
+
+Returns an `ARRAY<VARCHAR>` containing the tokenized strings as individual 
array elements (simple string array, not JSON objects).
+
+### Examples
+
+#### Example 1: Using the Chinese parser
+
+```sql
+SELECT TOKENIZE('我来到北京清华大学', "'parser'='chinese'");
+```
+```
+["我", "来到", "北京", "清华大学"]
+```
+
+#### Example 2: Chinese parser with fine-grained mode
+
+```sql
+SELECT TOKENIZE('我来到北京清华大学', "'parser'='chinese', 
'parser_mode'='fine_grained'");
+```
+```
+["我", "来到", "北京", "清华", "清华大学", "华大", "大学"]
+```
+
+#### Example 3: Using the Unicode parser
+
+```sql
+SELECT TOKENIZE('Apache Doris数据库', "'parser'='unicode'");
+```
+```
+["apache", "doris", "数", "据", "库"]
+```
+
+#### Example 4: Using character filters
+
+```sql
+SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0 test:abc=bcd',
+    '"parser"="unicode","char_filter_type" = 
"char_replace","char_filter_pattern" = "._=:,","char_filter_replacement" = " 
"');
+```
+```
+["get", "images", "hm", "bg", "jpg", "http", "1", "0", "test", "abc", "bcd"]
+```
+
+#### Example 5: Stopwords configuration
+
+```sql
+SELECT TOKENIZE('华夏智胜新税股票A', '"parser"="unicode"');
+```
+```
+["华", "夏", "智", "胜", "新", "税", "股", "票"]
+```
+
+```sql
+SELECT TOKENIZE('华夏智胜新税股票A', '"parser"="unicode","stopwords" = "none"');
+```
+```
+["华", "夏", "智", "胜", "新", "税", "股", "票", "a"]
+```
+
+---
+
+## Notes
+
+1. **Version Compatibility**:
+   - Version 3.0 uses `parser` parameter and returns simple string arrays
+   - Version 3.1+ supports both `parser` (backward compatible) and 
`built_in_analyzer`, returns JSON object arrays
+   - Version 3.1+ adds custom analyzers, additional built-in analyzers, and 
phrase support
+
+2. **Supported Analyzers**:
+   - **Version 3.0**: `chinese`, `english`, `unicode`
+   - **Version 3.1+**: `standard`, `english`, `chinese`, `unicode`, `icu`, 
`basic`, `ik`, `none`
+
+3. **Parser Mode**: The `parser_mode` property is primarily used with the 
`chinese` parser:
+   - `fine_grained`: Produces more detailed tokens with overlapping segments
+   - `coarse_grained`: Default mode with standard segmentation
+
+4. **Character Filters**: Use `char_filter_type`, `char_filter_pattern`, and 
`char_filter_replacement` together to replace specific characters before 
tokenization.
+
+5. **Performance**: The `TOKENIZE` function is primarily intended for testing 
and debugging parser configurations. For production full-text search, use 
inverted indexes with the `MATCH` predicate.
+
+6. **Compatibility with Inverted Indexes**: The same parser/analyzer 
configuration can be applied to inverted indexes:
+   ```sql
+   CREATE TABLE example (
+       content TEXT,
+       INDEX idx_content(content) USING INVERTED PROPERTIES("parser"="chinese")
+   )
+   ```
+
+7. **Testing Parser Behavior**: Use `TOKENIZE` to preview how text will be 
tokenized before creating inverted indexes.
+
+## Keywords
+
+TOKENIZE, STRING, FULL-TEXT SEARCH, INVERTED INDEX, PARSER, ANALYZER
diff --git 
a/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
 
b/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
index 36a148b499a..7dda53e8101 100644
--- 
a/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
+++ 
b/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md
@@ -5,3 +5,165 @@
 }
 ---
 
+## Description
+
+The `TOKENIZE` function tokenizes a string using a specified analyzer and 
returns the tokenization results as a JSON-formatted string array. This 
function is particularly useful for understanding how text will be analyzed 
when using inverted indexes with full-text search capabilities.
+
+## Syntax
+
+```sql
+VARCHAR TOKENIZE(VARCHAR str, VARCHAR properties)
+```
+
+## Parameters
+
+- `str`: The input string to be tokenized. Type: `VARCHAR`
+- `properties`: A property string specifying the analyzer configuration. Type: 
`VARCHAR`
+
+The `properties` parameter supports the following key-value pairs (format: 
`"key1"="value1", "key2"="value2"`):
+
+### Common Properties
+
+| Property | Description | Example Values |
+|----------|-------------|----------------|
+| `built_in_analyzer` | Built-in analyzer type | `"english"`, `"chinese"`, 
`"unicode"`, `"icu"`, `"basic"`, `"ik"`, `"standard"`, `"none"` |
+| `analyzer` | Custom analyzer name (created via `CREATE INVERTED INDEX 
ANALYZER`) | `"my_custom_analyzer"` |
+| `parser_mode` | Parser mode (for chinese analyzers) | `"fine_grained"`, 
`"coarse_grained"` |
+| `support_phrase` | Enable phrase support (stores position information) | 
`"true"`, `"false"` |
+| `lower_case` | Convert tokens to lowercase | `"true"`, `"false"` |
+| `char_filter_type` | Character filter type | Varies by filter |
+| `stop_words` | Stop words configuration | Varies by implementation |
+
+## Return Value
+
+Returns a `VARCHAR` containing a JSON array of tokenization results. Each 
element in the array is an object with the following structure:
+
+- `token`: The tokenized term
+- `position`: (Optional) The position index of the token when `support_phrase` 
is enabled
+
+## Examples
+
+### Example 1: Using built-in analyzers
+
+```sql
+-- Using the standard analyzer
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }]
+```
+
+```sql
+-- Using the english analyzer
+SELECT TOKENIZE("running quickly", '"built_in_analyzer"="english"');
+```
+```
+[{ "token": "run" }, { "token": "quick" }]
+```
+
+```sql
+-- Using the unicode analyzer with Chinese text
+SELECT TOKENIZE("Apache Doris数据库", '"built_in_analyzer"="unicode"');
+```
+```
+[{ "token": "apache" }, { "token": "doris" }, { "token": "数" }, { "token": "据" 
}, { "token": "库" }]
+```
+
+```sql
+-- Using the chinese analyzer
+SELECT TOKENIZE("我来到北京清华大学", '"built_in_analyzer"="chinese"');
+```
+```
+[{ "token": "我" }, { "token": "来到" }, { "token": "北京" }, { "token": "清华大学" }]
+```
+
+```sql
+-- Using the icu analyzer for multilingual text
+SELECT TOKENIZE("Hello World 世界", '"built_in_analyzer"="icu"');
+```
+```
+[{ "token": "hello" }, { "token": "world" }, {"token": "世界"}]
+```
+
+```sql
+-- Using the basic analyzer
+SELECT TOKENIZE("GET /images/hm_bg.jpg HTTP/1.0", 
'"built_in_analyzer"="basic"');
+```
+```
+[{ "token": "get" }, { "token": "images" }, {"token": "hm"}, {"token": "bg"}, 
{"token": "jpg"}, {"token": "http"}, {"token": "1"}, {"token": "0"}]
+```
+
+```sql
+-- Using the ik analyzer for Chinese text
+SELECT TOKENIZE("中华人民共和国国歌", '"built_in_analyzer"="ik"');
+```
+```
+[{ "token": "中华人民共和国" }, { "token": "国歌" }]
+```
+
+### Example 2: Using custom analyzers
+
+First, create a custom analyzer:
+
+```sql
+CREATE INVERTED INDEX ANALYZER lowercase_delimited
+PROPERTIES (
+    "tokenizer" = "standard",
+    "token_filter" = "asciifolding, lowercase"
+);
+```
+
+Then use it with `TOKENIZE`:
+
+```sql
+SELECT TOKENIZE("FOO-BAR", '"analyzer"="lowercase_delimited"');
+```
+```
+[{ "token": "foo" }, { "token": "bar" }]
+```
+
+### Example 3: With phrase support (position information)
+
+```sql
+SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard", 
"support_phrase"="true"');
+```
+```
+[{ "token": "hello", "position": 0 }, { "token": "world", "position": 1 }]
+```
+
+## Notes
+
+1. **Analyzer Configuration**: The `properties` parameter must be a valid 
property string. If using a custom analyzer, it must be created beforehand 
using `CREATE INVERTED INDEX ANALYZER`.
+
+2. **Supported Analyzers**: Currently supported built-in analyzers include:
+   - `standard`: Standard analyzer for general text
+   - `english`: English language analyzer with stemming
+   - `chinese`: Chinese text analyzer
+   - `unicode`: Unicode-based analyzer for multilingual text
+   - `icu`: ICU-based analyzer for advanced Unicode processing
+   - `basic`: Basic tokenization
+   - `ik`: IK analyzer for Chinese text
+   - `none`: No tokenization (returns original string as single token)
+
+3. **Performance**: The `TOKENIZE` function is primarily intended for testing 
and debugging analyzer configurations. For production full-text search, use 
inverted indexes with the `MATCH` or `SEARCH` operators.
+
+4. **JSON Output**: The output is a formatted JSON string that can be further 
processed using JSON functions if needed.
+
+5. **Compatibility with Inverted Indexes**: The same analyzer configuration 
used in `TOKENIZE` can be applied to inverted indexes when creating tables:
+   ```sql
+   CREATE TABLE example (
+       content TEXT,
+       INDEX idx_content(content) USING INVERTED 
PROPERTIES("analyzer"="my_analyzer")
+   )
+   ```
+
+6. **Testing Analyzer Behavior**: Use `TOKENIZE` to preview how text will be 
tokenized before creating inverted indexes, helping to choose the most 
appropriate analyzer for your data.
+
+## Related Functions
+
+- 
[MATCH](../../../../sql-manual/basic-element/operators/conditional-operators/full-text-search-operators):
 Full-text search using inverted indexes
+- [SEARCH](../../../../ai/text-search/search-function): Advanced search with 
DSL support
+
+## Keywords
+
+TOKENIZE, STRING, FULL-TEXT SEARCH, INVERTED INDEX, ANALYZER


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: [update](tokenize) add tokenize function (#3103)

Reply via email to