This is an automated email from the ASF dual-hosted git repository.
yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new fa1161f51c0 [update](inverted index) update 3.x inverted index docs
for version 3.1 (#2987)
fa1161f51c0 is described below
commit fa1161f51c00f11b45245134cc7d495002237525
Author: Jack <[email protected]>
AuthorDate: Mon Oct 20 21:11:15 2025 +0800
[update](inverted index) update 3.x inverted index docs for version 3.1
(#2987)
## Versions
- [ ] dev
- [x] 3.x
- [ ] 2.1
- [ ] 2.0
## Languages
- [x] Chinese
- [x] English
## Docs Checklist
- [ ] Checked by AI
- [ ] Test Cases Built
---
.../table-design/index/inverted-index.md | 83 +++++++++++++++++++++
.../table-design/index/inverted-index.md | 84 ++++++++++++++++++++++
2 files changed, 167 insertions(+)
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/table-design/index/inverted-index.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/table-design/index/inverted-index.md
index 3bf3029c4c4..aa7363ca275 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/table-design/index/inverted-index.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/table-design/index/inverted-index.md
@@ -99,6 +99,9 @@ table_properties;
<p>- `english` 是英文分词,适合被索引列是英文的情况,用空格和标点符号分词,性能高</p>
<p>- `chinese` 是中文分词,适合被索引列主要是中文的情况,性能比 English 分词低</p>
<p>- `unicode` 是多语言混合类型分词,适用于中英文混合、多语言混合的情况。它能够对邮箱前缀和后缀、IP
地址以及字符数字混合进行分词,并且可以对中文按字符分词。</p>
+ <p>- `icu` (该功能自 3.1.0 版本开始支持): ICU(International Components for
Unicode)分词,基于 ICU 库实现。适用于国际化文本、复杂书写系统和多语言文档。支持阿拉伯语、泰语等多种 Unicode 脚本语言。</p>
+ <p>- `basic` (该功能自 3.1.0 版本开始支持):
基本规则分词器,使用简单的字符类型识别。适用于对性能要求极高或简单文本处理需求的场景。规则:连续的字母数字字符作为一个词,每个中文字符作为单独的词,标点符号/空格/特殊字符被忽略。该分词器在所有分词器中性能最高,但分词逻辑比
unicode 或 icu 更简单。</p>
+ <p>- `ik` (该功能自 3.1.0 版本开始支持): IK 中文分词,专门为中文文本分析设计。</p>
分词的效果可以通过 `TOKENIZE` SQL 函数进行验证,具体参考后续章节。
</details>
@@ -170,8 +173,44 @@ table_properties;
<p>- none: 使用空的停用词表</p>
</details>
+<details>
+ <summary>dict_compression (该功能自 3.1.0 版本开始支持)</summary>
+
+ **指定是否对倒排索引的词典启用 ZSTD 字典压缩**
+ <p>- true: 启用字典压缩,可以减少索引存储大小最多达 20%,特别适用于大规模文本数据和日志分析场景</p>
+ <p>- false: 禁用字典压缩(默认值)</p>
+ <p>- 使用建议:对于大文本数据集、日志分析或存储成本敏感的场景建议启用。与 inverted_index_storage_format = "V3"
配合使用效果最佳</p>
+
+ 示例:
+```sql
+ INDEX idx_name(column_name) USING INVERTED PROPERTIES("parser" = "english",
"dict_compression" = "true")
+```
+</details>
+
**4. `COMMENT` 是可选的,用于指定索引注释**
+**5. 表级属性 `inverted_index_storage_format` (该功能自 3.1.0 版本开始支持)**
+
+ 要使用新的 V3 存储格式,在建表时指定此属性:
+
+```sql
+CREATE TABLE table_name (
+ column_name TEXT,
+ INDEX idx_name(column_name) USING INVERTED PROPERTIES("parser" =
"english", "dict_compression" = "true")
+) PROPERTIES (
+ "inverted_index_storage_format" = "V3"
+);
+```
+
+ **inverted_index_storage_format 取值:**
+ <p>- "V2": 默认存储格式</p>
+ <p>- "V3": 新的存储格式,具有优化的压缩能力。与 V2 相比,V3 提供:</p>
+ <p> - 更小的索引文件,减少磁盘使用和 I/O 开销</p>
+ <p> - 对于大规模文本数据和日志分析场景,最多可节省 20% 的存储空间</p>
+ <p> - 对词典启用 ZSTD 字典压缩(当 dict_compression 启用时)</p>
+ <p> - 对每个词关联的位置信息进行压缩</p>
+ <p>- 使用建议:对于大文本数据集、日志分析工作负载或存储优化很重要的新表,建议使用 V3</p>
+
### 已有表增加倒排索引
@@ -378,6 +417,50 @@ SELECT TOKENIZE('I love CHINA
我爱我的祖国','"parser"="unicode"');
+-------------------------------------------------------------------+
| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"] |
+-------------------------------------------------------------------+
+
+-- ICU 分词多语言文本 (该功能自 3.1.0 版本开始支持)
+SELECT TOKENIZE('مرحبا بالعالم Hello 世界', '"parser"="icu"');
++--------------------------------------------------------+
+| tokenize('مرحبا بالعالم Hello 世界', '"parser"="icu"') |
++--------------------------------------------------------+
+| ["مرحبا", "بالعالم", "Hello", "世界"] |
++--------------------------------------------------------+
+
+SELECT TOKENIZE('มนไมเปนไปตามความตองการ', '"parser"="icu"');
++-------------------------------------------------------------------+
+| tokenize('มนไมเปนไปตามความตองการ', '"parser"="icu"') |
++-------------------------------------------------------------------+
+| ["มน", "ไมเปน", "ไป", "ตาม", "ความ", "ตองการ"] |
++-------------------------------------------------------------------+
+
+-- Basic 分词高性能场景 (该功能自 3.1.0 版本开始支持)
+SELECT TOKENIZE('Hello World! This is a test.', '"parser"="basic"');
++-----------------------------------------------------------+
+| tokenize('Hello World! This is a test.', '"parser"="basic"') |
++-----------------------------------------------------------+
+| ["hello", "world", "this", "is", "a", "test"] |
++-----------------------------------------------------------+
+
+SELECT TOKENIZE('你好世界', '"parser"="basic"');
++-------------------------------------------+
+| tokenize('你好世界', '"parser"="basic"') |
++-------------------------------------------+
+| ["你", "好", "世", "界"] |
++-------------------------------------------+
+
+SELECT TOKENIZE('Hello你好World世界', '"parser"="basic"');
++------------------------------------------------------+
+| tokenize('Hello你好World世界', '"parser"="basic"') |
++------------------------------------------------------+
+| ["hello", "你", "好", "world", "世", "界"] |
++------------------------------------------------------+
+
+SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0', '"parser"="basic"');
++---------------------------------------------------------------------+
+| tokenize('GET /images/hm_bg.jpg HTTP/1.0', '"parser"="basic"') |
++---------------------------------------------------------------------+
+| ["get", "images", "hm", "bg", "jpg", "http", "1", "0"] |
++---------------------------------------------------------------------+
```
## 使用示例
diff --git a/versioned_docs/version-3.x/table-design/index/inverted-index.md
b/versioned_docs/version-3.x/table-design/index/inverted-index.md
index 1bfe5b06557..d903d9ea0b1 100644
--- a/versioned_docs/version-3.x/table-design/index/inverted-index.md
+++ b/versioned_docs/version-3.x/table-design/index/inverted-index.md
@@ -97,6 +97,9 @@ Syntax explanation:
<p>- `english`: English tokenization, suitable for columns with English
text, uses spaces and punctuation for tokenization, high performance</p>
<p>- `chinese`: Chinese tokenization, suitable for columns with mainly
Chinese text, lower performance than English tokenization</p>
<p>- `unicode`: Unicode tokenization, suitable for mixed Chinese and
English, and mixed multilingual texts. It can tokenize email prefixes and
suffixes, IP addresses, and mixed character and number strings, and can
tokenize Chinese by characters.</p>
+ <p>- `icu` (Supported since 3.1.0): ICU (International Components for
Unicode) tokenization, based on the ICU library. Ideal for internationalized
text with complex writing systems and multilingual documents. Supports
languages like Arabic, Thai, and other Unicode-based scripts.</p>
+ <p>- `basic` (Supported since 3.1.0): Basic rule-based tokenization using
simple character type recognition. Suitable for scenarios with extremely high
performance requirements or simple text processing needs. Rules: continuous
alphanumeric characters are treated as one token, each Chinese character is a
separate token, and punctuation/spaces/special characters are ignored. This
tokenizer provides the best performance among all tokenizers but with simpler
tokenization logic compared to [...]
+ <p>- `ik` (Supported since 3.1.0): IK Chinese tokenization, specifically
designed for Chinese text analysis.</p>
Tokenization results can be verified using the `TOKENIZE` SQL function, see
the following sections for details.
</details>
@@ -168,8 +171,44 @@ Syntax explanation:
<p>- none: Use an empty stopword list</p>
</details>
+<details>
+ <summary>dict_compression (Supported since 3.1.0)</summary>
+
+ **Specifies whether to enable ZSTD dictionary compression for the inverted
index term dictionary**
+ <p>- true: Enable dictionary compression, which can reduce index storage
size by up to 20%, especially effective for large-scale text data and log
analysis scenarios</p>
+ <p>- false: Disable dictionary compression (default)</p>
+ <p>- Recommendation: Enable for scenarios with large text datasets, log
analytics, or when storage cost is a concern. Works best with
inverted_index_storage_format = "V3"</p>
+
+ For example:
+```sql
+ INDEX idx_name(column_name) USING INVERTED PROPERTIES("parser" = "english",
"dict_compression" = "true")
+```
+</details>
+
**4. `COMMENT` is optional for specifying index comments**
+**5. Table-level property `inverted_index_storage_format` (Supported since
3.1.0)**
+
+ To use the new V3 storage format for inverted indexes, specify this property
when creating the table:
+
+```sql
+CREATE TABLE table_name (
+ column_name TEXT,
+ INDEX idx_name(column_name) USING INVERTED PROPERTIES("parser" =
"english", "dict_compression" = "true")
+) PROPERTIES (
+ "inverted_index_storage_format" = "V3"
+);
+```
+
+ **inverted_index_storage_format values:**
+ <p>- "V2": Default storage format</p>
+ <p>- "V3": New storage format with optimized compression. Compared to V2, V3
provides:</p>
+ <p> - Smaller index files, reducing disk usage and I/O overhead</p>
+ <p> - Up to 20% storage space savings for large-scale text data and log
analysis scenarios</p>
+ <p> - ZSTD dictionary compression for term dictionaries (when
dict_compression is enabled)</p>
+ <p> - Compression for positional information associated with each term</p>
+ <p>- Recommendation: Use V3 for new tables with large text datasets, log
analytics workloads, or when storage optimization is important</p>
+
### Adding Inverted Indexes to Existing Tables
**1. ADD INDEX**
@@ -339,6 +378,8 @@ To check the actual effect of tokenization or to tokenize a
piece of text, you c
The first parameter of the `TOKENIZE` function is the text to be tokenized,
and the second parameter specifies the tokenization parameters used when
creating the index.
+```sql
+-- English tokenization
SELECT TOKENIZE('I love Doris','"parser"="english"');
+------------------------------------------------+
| tokenize('I love Doris', '"parser"="english"') |
@@ -346,6 +387,49 @@ SELECT TOKENIZE('I love Doris','"parser"="english"');
| ["i", "love", "doris"] |
+------------------------------------------------+
+-- ICU tokenization for multilingual text (Supported since 3.1.0)
+SELECT TOKENIZE('مرحبا بالعالم Hello 世界', '"parser"="icu"');
++--------------------------------------------------------+
+| tokenize('مرحبا بالعالم Hello 世界', '"parser"="icu"') |
++--------------------------------------------------------+
+| ["مرحبا", "بالعالم", "Hello", "世界"] |
++--------------------------------------------------------+
+
+SELECT TOKENIZE('มนไมเปนไปตามความตองการ', '"parser"="icu"');
++-------------------------------------------------------------------+
+| tokenize('มนไมเปนไปตามความตองการ', '"parser"="icu"') |
++-------------------------------------------------------------------+
+| ["มน", "ไมเปน", "ไป", "ตาม", "ความ", "ตองการ"] |
++-------------------------------------------------------------------+
+
+-- Basic tokenization for high performance (Supported since 3.1.0)
+SELECT TOKENIZE('Hello World! This is a test.', '"parser"="basic"');
++-----------------------------------------------------------+
+| tokenize('Hello World! This is a test.', '"parser"="basic"') |
++-----------------------------------------------------------+
+| ["hello", "world", "this", "is", "a", "test"] |
++-----------------------------------------------------------+
+
+SELECT TOKENIZE('你好世界', '"parser"="basic"');
++-------------------------------------------+
+| tokenize('你好世界', '"parser"="basic"') |
++-------------------------------------------+
+| ["你", "好", "世", "界"] |
++-------------------------------------------+
+
+SELECT TOKENIZE('Hello你好World世界', '"parser"="basic"');
++------------------------------------------------------+
+| tokenize('Hello你好World世界', '"parser"="basic"') |
++------------------------------------------------------+
+| ["hello", "你", "好", "world", "世", "界"] |
++------------------------------------------------------+
+
+SELECT TOKENIZE('GET /images/hm_bg.jpg HTTP/1.0', '"parser"="basic"');
++---------------------------------------------------------------------+
+| tokenize('GET /images/hm_bg.jpg HTTP/1.0', '"parser"="basic"') |
++---------------------------------------------------------------------+
+| ["get", "images", "hm", "bg", "jpg", "http", "1", "0"] |
++---------------------------------------------------------------------+
```
## Usage Example
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]