This is an automated email from the ASF dual-hosted git repository.
yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new b2577f3c84d [docs] Add multi-analyzer inverted index documentation
(#3275)
b2577f3c84d is described below
commit b2577f3c84d13d8cfd8ad0e0363f2b1073328943
Author: Jack <[email protected]>
AuthorDate: Sun Jan 18 22:33:24 2026 +0800
[docs] Add multi-analyzer inverted index documentation (#3275)
Add documentation for the new feature that allows creating multiple
inverted indexes with different analyzers on a single column.
Changes:
- Add USING ANALYZER syntax to search-operators.md (EN/CN)
- Add "Multiple Analyzers on Single Column" section to
custom-analyzer.md (EN/CN)
The documentation covers:
- Syntax and supported operators
- Creating multiple indexes on same column
- Querying with specific analyzer
- Adding indexes to existing tables
- Building indexes
- Important notes on analyzer identity and index selection
## Versions
- [ ] dev
- [ ] 4.x
- [ ] 3.x
- [ ] 2.1
## Languages
- [ ] Chinese
- [ ] English
## Docs Checklist
- [ ] Checked by AI
- [ ] Test Cases Built
Co-authored-by: Claude Opus 4.5 <[email protected]>
---
docs/ai/text-search/custom-analyzer.md | 125 ++++++++++++++++++++-
docs/ai/text-search/search-operators.md | 39 +++++++
.../current/ai/text-search/custom-analyzer.md | 123 ++++++++++++++++++++
.../current/ai/text-search/search-operators.md | 39 +++++++
4 files changed, 325 insertions(+), 1 deletion(-)
diff --git a/docs/ai/text-search/custom-analyzer.md
b/docs/ai/text-search/custom-analyzer.md
index f6b050de611..1c183478240 100644
--- a/docs/ai/text-search/custom-analyzer.md
+++ b/docs/ai/text-search/custom-analyzer.md
@@ -285,4 +285,127 @@ SELECT * FROM stars WHERE name MATCH '刘德华';
SELECT * FROM stars WHERE name MATCH 'liu';
SELECT * FROM stars WHERE name MATCH 'ldh';
SELECT * FROM stars WHERE name MATCH 'zxy';
-```
\ No newline at end of file
+```
+
+## Multiple Analyzers on Single Column
+
+Doris supports creating multiple inverted indexes with different analyzers on
a single column. This enables flexible search strategies where the same data
can be searched using different tokenization methods.
+
+### Use Cases
+
+- **Multi-language support**: Use different analyzers for different languages
on the same text column
+- **Search precision vs. recall**: Use keyword analyzer for exact match and
standard analyzer for fuzzy search
+- **Autocomplete**: Use edge_ngram analyzer for prefix matching while keeping
standard analyzer for regular search
+
+### Creating Multiple Indexes
+
+```sql
+-- Create analyzers with different tokenization strategies
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS std_analyzer
+PROPERTIES ("tokenizer" = "standard", "token_filter" = "lowercase");
+
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS kw_analyzer
+PROPERTIES ("tokenizer" = "keyword", "token_filter" = "lowercase");
+
+CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS edge_ngram_tokenizer
+PROPERTIES (
+ "type" = "edge_ngram",
+ "min_gram" = "1",
+ "max_gram" = "20",
+ "token_chars" = "letter"
+);
+
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS ngram_analyzer
+PROPERTIES ("tokenizer" = "edge_ngram_tokenizer", "token_filter" =
"lowercase");
+
+-- Create table with multiple indexes on same column
+CREATE TABLE articles (
+ id INT,
+ content TEXT,
+ -- Standard analyzer for tokenized search
+ INDEX idx_content_std (content) USING INVERTED
+ PROPERTIES("analyzer" = "std_analyzer", "support_phrase" = "true"),
+ -- Keyword analyzer for exact match
+ INDEX idx_content_kw (content) USING INVERTED
+ PROPERTIES("analyzer" = "kw_analyzer"),
+ -- Edge n-gram analyzer for autocomplete
+ INDEX idx_content_ngram (content) USING INVERTED
+ PROPERTIES("analyzer" = "ngram_analyzer")
+) ENGINE=OLAP
+DUPLICATE KEY(id)
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES ("replication_allocation" = "tag.location.default: 1");
+```
+
+### Querying with Specific Analyzer
+
+Use `USING ANALYZER` clause to specify which index to use:
+
+```sql
+-- Insert test data
+INSERT INTO articles VALUES
+ (1, 'hello world'),
+ (2, 'hello'),
+ (3, 'world'),
+ (4, 'hello world test');
+
+-- Tokenized search: matches rows containing 'hello' token
+-- Returns: 1, 2, 4
+SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER
std_analyzer ORDER BY id;
+
+-- Exact match: only matches rows with exact 'hello' string
+-- Returns: 2
+SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER kw_analyzer
ORDER BY id;
+
+-- Prefix match with edge n-gram
+-- Returns: 1, 2, 4 (all rows starting with 'hel')
+SELECT id FROM articles WHERE content MATCH 'hel' USING ANALYZER
ngram_analyzer ORDER BY id;
+```
+
+### Adding Indexes to Existing Tables
+
+```sql
+-- Add a new index with different analyzer
+ALTER TABLE articles ADD INDEX idx_content_chinese (content)
+USING INVERTED PROPERTIES("parser" = "chinese");
+
+-- Wait for schema change to complete
+SHOW ALTER TABLE COLUMN WHERE TableName='articles';
+```
+
+### Building Indexes
+
+After adding an index, you need to build it for existing data:
+
+```sql
+-- Build specific index (non-cloud mode)
+BUILD INDEX idx_content_chinese ON articles;
+
+-- Build all indexes (cloud mode)
+BUILD INDEX ON articles;
+
+-- Check build progress
+SHOW BUILD INDEX WHERE TableName='articles';
+```
+
+### Important Notes
+
+1. **Analyzer Identity**: Two analyzers with the same tokenizer and
token_filter configuration are considered identical. You cannot create multiple
indexes with identical analyzer identities on the same column.
+
+2. **Index Selection Behavior**:
+ - When using `USING ANALYZER`, if the specified analyzer's index exists and
is built, it will be used
+ - If the specified index is not built, the query falls back to non-index
path (correct results, slower performance)
+ - Without `USING ANALYZER`, any available index may be used
+
+3. **Built-in Analyzers**: You can also use built-in analyzers directly:
+ ```sql
+ -- Using built-in analyzers
+ SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard;
+ SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none;
+ SELECT * FROM articles WHERE content MATCH '你好' USING ANALYZER chinese;
+ ```
+
+4. **Performance Considerations**:
+ - Each additional index increases storage space and write overhead
+ - Choose analyzers based on actual query patterns
+ - Consider using fewer indexes if query patterns are predictable
\ No newline at end of file
diff --git a/docs/ai/text-search/search-operators.md
b/docs/ai/text-search/search-operators.md
index fe9ac3e5d00..238df7c484f 100644
--- a/docs/ai/text-search/search-operators.md
+++ b/docs/ai/text-search/search-operators.md
@@ -63,6 +63,45 @@ SELECT * FROM table_name WHERE content MATCH_REGEXP
'^key_word.*';
SELECT * FROM table_name WHERE content MATCH_PHRASE_EDGE 'search engine optim';
```
+### Specifying Analyzer with USING ANALYZER
+
+When a column has multiple inverted indexes with different analyzers, use the
`USING ANALYZER` clause to specify which analyzer to use for the query.
+
+**Syntax:**
+```sql
+SELECT * FROM table_name WHERE column MATCH 'keywords' USING ANALYZER
analyzer_name;
+```
+
+**Supported Operators:**
+All MATCH operators support the `USING ANALYZER` clause:
+- MATCH / MATCH_ANY
+- MATCH_ALL
+- MATCH_PHRASE
+- MATCH_PHRASE_PREFIX
+- MATCH_PHRASE_EDGE
+- MATCH_REGEXP
+
+**Examples:**
+```sql
+-- Use standard analyzer (tokenizes text into words)
+SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER
std_analyzer;
+
+-- Use keyword analyzer (exact match, no tokenization)
+SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER
kw_analyzer;
+
+-- Use with MATCH_PHRASE
+SELECT * FROM articles WHERE content MATCH_PHRASE 'hello world' USING ANALYZER
std_analyzer;
+
+-- Use built-in analyzers
+SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard;
+SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none;
+```
+
+**Notes:**
+- If the specified analyzer's index is not built, the query automatically
falls back to non-index path (correct results, slower performance)
+- If no analyzer is specified, the system uses any available index
+- Built-in analyzer names: `none` (exact match), `standard`, `chinese`
+
## Inverted Index Query Acceleration
### Supported Operators and Functions
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
index 4ee82a8c309..a693648cdf9 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
@@ -288,3 +288,126 @@ select tokenize('hÉllo World',
'"analyzer"="keyword_lowercase"');
{"token":"hello world"}
]
```
+
+## 一列多分词索引
+
+Doris 支持在同一个列上创建多个使用不同分词器的倒排索引。这使得同一份数据可以使用不同的分词策略进行搜索,提供灵活的搜索能力。
+
+### 应用场景
+
+- **多语言支持**:在同一文本列上使用不同语言的分词器
+- **搜索精度与召回率**:使用关键词分词器进行精确匹配,使用标准分词器进行模糊搜索
+- **自动补全**:使用 edge_ngram 分词器进行前缀匹配,同时保留标准分词器用于常规搜索
+
+### 创建多个索引
+
+```sql
+-- 创建不同分词策略的分词器
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS std_analyzer
+PROPERTIES ("tokenizer" = "standard", "token_filter" = "lowercase");
+
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS kw_analyzer
+PROPERTIES ("tokenizer" = "keyword", "token_filter" = "lowercase");
+
+CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS edge_ngram_tokenizer
+PROPERTIES (
+ "type" = "edge_ngram",
+ "min_gram" = "1",
+ "max_gram" = "20",
+ "token_chars" = "letter"
+);
+
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS ngram_analyzer
+PROPERTIES ("tokenizer" = "edge_ngram_tokenizer", "token_filter" =
"lowercase");
+
+-- 在同一列上创建多个索引
+CREATE TABLE articles (
+ id INT,
+ content TEXT,
+ -- 标准分词器用于分词搜索
+ INDEX idx_content_std (content) USING INVERTED
+ PROPERTIES("analyzer" = "std_analyzer", "support_phrase" = "true"),
+ -- 关键词分词器用于精确匹配
+ INDEX idx_content_kw (content) USING INVERTED
+ PROPERTIES("analyzer" = "kw_analyzer"),
+ -- edge n-gram 分词器用于自动补全
+ INDEX idx_content_ngram (content) USING INVERTED
+ PROPERTIES("analyzer" = "ngram_analyzer")
+) ENGINE=OLAP
+DUPLICATE KEY(id)
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES ("replication_allocation" = "tag.location.default: 1");
+```
+
+### 使用指定分词器查询
+
+使用 `USING ANALYZER` 子句指定使用哪个索引:
+
+```sql
+-- 插入测试数据
+INSERT INTO articles VALUES
+ (1, 'hello world'),
+ (2, 'hello'),
+ (3, 'world'),
+ (4, 'hello world test');
+
+-- 分词搜索:匹配包含 'hello' 词项的行
+-- 返回:1, 2, 4
+SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER
std_analyzer ORDER BY id;
+
+-- 精确匹配:仅匹配精确的 'hello' 字符串
+-- 返回:2
+SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER kw_analyzer
ORDER BY id;
+
+-- 使用 edge n-gram 进行前缀匹配
+-- 返回:1, 2, 4(所有以 'hel' 开头的行)
+SELECT id FROM articles WHERE content MATCH 'hel' USING ANALYZER
ngram_analyzer ORDER BY id;
+```
+
+### 为已有表添加索引
+
+```sql
+-- 添加使用不同分词器的新索引
+ALTER TABLE articles ADD INDEX idx_content_chinese (content)
+USING INVERTED PROPERTIES("parser" = "chinese");
+
+-- 等待 schema change 完成
+SHOW ALTER TABLE COLUMN WHERE TableName='articles';
+```
+
+### 构建索引
+
+添加索引后,需要为已有数据构建索引:
+
+```sql
+-- 构建指定索引(非云模式)
+BUILD INDEX idx_content_chinese ON articles;
+
+-- 构建所有索引(云模式)
+BUILD INDEX ON articles;
+
+-- 查看构建进度
+SHOW BUILD INDEX WHERE TableName='articles';
+```
+
+### 重要说明
+
+1. **分词器身份识别**:两个具有相同 tokenizer 和 token_filter
配置的分词器被视为相同。不能在同一列上创建具有相同分词器身份的多个索引。
+
+2. **索引选择行为**:
+ - 使用 `USING ANALYZER` 时,如果指定分词器的索引存在且已构建,则使用该索引
+ - 如果索引未构建,查询会降级到非索引路径(结果正确,但性能较慢)
+ - 未使用 `USING ANALYZER` 时,可能使用任意可用的索引
+
+3. **内置分词器**:也可以直接使用内置分词器:
+ ```sql
+ -- 使用内置分词器
+ SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard;
+ SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none;
+ SELECT * FROM articles WHERE content MATCH '你好' USING ANALYZER chinese;
+ ```
+
+4. **性能考虑**:
+ - 每增加一个索引都会增加存储空间和写入开销
+ - 根据实际查询模式选择分词器
+ - 如果查询模式可预测,考虑使用较少的索引
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md
index b730e14d7c9..84b9d314ec8 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md
@@ -63,6 +63,45 @@ SELECT * FROM table_name WHERE content MATCH_REGEXP
'^key_word.*';
SELECT * FROM table_name WHERE content MATCH_PHRASE_EDGE 'search engine optim';
```
+### 使用 USING ANALYZER 指定分词器
+
+当一个列上创建了多个使用不同分词器的倒排索引时,可以使用 `USING ANALYZER` 子句指定查询时使用哪个分词器。
+
+**语法:**
+```sql
+SELECT * FROM table_name WHERE column MATCH 'keywords' USING ANALYZER
analyzer_name;
+```
+
+**支持的算子:**
+所有 MATCH 算子都支持 `USING ANALYZER` 子句:
+- MATCH / MATCH_ANY
+- MATCH_ALL
+- MATCH_PHRASE
+- MATCH_PHRASE_PREFIX
+- MATCH_PHRASE_EDGE
+- MATCH_REGEXP
+
+**示例:**
+```sql
+-- 使用标准分词器(将文本分词)
+SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER
std_analyzer;
+
+-- 使用关键词分词器(精确匹配,不分词)
+SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER
kw_analyzer;
+
+-- 配合 MATCH_PHRASE 使用
+SELECT * FROM articles WHERE content MATCH_PHRASE 'hello world' USING ANALYZER
std_analyzer;
+
+-- 使用内置分词器
+SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard;
+SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none;
+```
+
+**注意事项:**
+- 如果指定分词器的索引未构建,查询会自动降级到非索引路径(结果正确,但性能较慢)
+- 如果未指定分词器,系统会使用任意可用的索引
+- 内置分词器名称:`none`(精确匹配)、`standard`(标准分词)、`chinese`(中文分词)
+
## 倒排索引查询加速
### 支持的运算符和函数
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]