(doris-website) 01/01: [docs] Add multi-analyzer inverted index documentation

airborne Sun, 11 Jan 2026 01:07:45 -0800

This is an automated email from the ASF dual-hosted git repository.

airborne pushed a commit to branch feature/multi-analyzer-index-docs
in repository https://gitbox.apache.org/repos/asf/doris-website.git


commit 237b4880e98eac9db277d9e62b32869132cb74bb
Author: airborne12 <[email protected]>
AuthorDate: Sun Jan 11 17:06:12 2026 +0800

    [docs] Add multi-analyzer inverted index documentation
    
    Add documentation for the new feature that allows creating multiple
    inverted indexes with different analyzers on a single column.
    
    Changes:
    - Add USING ANALYZER syntax to search-operators.md (EN/CN)
    - Add "Multiple Analyzers on Single Column" section to custom-analyzer.md 
(EN/CN)
    
    The documentation covers:
    - Syntax and supported operators
    - Creating multiple indexes on same column
    - Querying with specific analyzer
    - Adding indexes to existing tables
    - Building indexes
    - Important notes on analyzer identity and index selection
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
---
 docs/ai/text-search/custom-analyzer.md             | 125 ++++++++++++++++++++-
 docs/ai/text-search/search-operators.md            |  39 +++++++
 .../current/ai/text-search/custom-analyzer.md      | 123 ++++++++++++++++++++
 .../current/ai/text-search/search-operators.md     |  39 +++++++
 4 files changed, 325 insertions(+), 1 deletion(-)

diff --git a/docs/ai/text-search/custom-analyzer.md 
b/docs/ai/text-search/custom-analyzer.md
index f6b050de611..1c183478240 100644
--- a/docs/ai/text-search/custom-analyzer.md
+++ b/docs/ai/text-search/custom-analyzer.md
@@ -285,4 +285,127 @@ SELECT * FROM stars WHERE name MATCH '刘德华';
 SELECT * FROM stars WHERE name MATCH 'liu';
 SELECT * FROM stars WHERE name MATCH 'ldh';
 SELECT * FROM stars WHERE name MATCH 'zxy';
-```
\ No newline at end of file
+```
+
+## Multiple Analyzers on Single Column
+
+Doris supports creating multiple inverted indexes with different analyzers on 
a single column. This enables flexible search strategies where the same data 
can be searched using different tokenization methods.
+
+### Use Cases
+
+- **Multi-language support**: Use different analyzers for different languages 
on the same text column
+- **Search precision vs. recall**: Use keyword analyzer for exact match and 
standard analyzer for fuzzy search
+- **Autocomplete**: Use edge_ngram analyzer for prefix matching while keeping 
standard analyzer for regular search
+
+### Creating Multiple Indexes
+
+```sql
+-- Create analyzers with different tokenization strategies
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS std_analyzer
+PROPERTIES ("tokenizer" = "standard", "token_filter" = "lowercase");
+
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS kw_analyzer
+PROPERTIES ("tokenizer" = "keyword", "token_filter" = "lowercase");
+
+CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS edge_ngram_tokenizer
+PROPERTIES (
+    "type" = "edge_ngram",
+    "min_gram" = "1",
+    "max_gram" = "20",
+    "token_chars" = "letter"
+);
+
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS ngram_analyzer
+PROPERTIES ("tokenizer" = "edge_ngram_tokenizer", "token_filter" = 
"lowercase");
+
+-- Create table with multiple indexes on same column
+CREATE TABLE articles (
+    id INT,
+    content TEXT,
+    -- Standard analyzer for tokenized search
+    INDEX idx_content_std (content) USING INVERTED
+        PROPERTIES("analyzer" = "std_analyzer", "support_phrase" = "true"),
+    -- Keyword analyzer for exact match
+    INDEX idx_content_kw (content) USING INVERTED
+        PROPERTIES("analyzer" = "kw_analyzer"),
+    -- Edge n-gram analyzer for autocomplete
+    INDEX idx_content_ngram (content) USING INVERTED
+        PROPERTIES("analyzer" = "ngram_analyzer")
+) ENGINE=OLAP
+DUPLICATE KEY(id)
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES ("replication_allocation" = "tag.location.default: 1");
+```
+
+### Querying with Specific Analyzer
+
+Use `USING ANALYZER` clause to specify which index to use:
+
+```sql
+-- Insert test data
+INSERT INTO articles VALUES
+    (1, 'hello world'),
+    (2, 'hello'),
+    (3, 'world'),
+    (4, 'hello world test');
+
+-- Tokenized search: matches rows containing 'hello' token
+-- Returns: 1, 2, 4
+SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER 
std_analyzer ORDER BY id;
+
+-- Exact match: only matches rows with exact 'hello' string
+-- Returns: 2
+SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER kw_analyzer 
ORDER BY id;
+
+-- Prefix match with edge n-gram
+-- Returns: 1, 2, 4 (all rows starting with 'hel')
+SELECT id FROM articles WHERE content MATCH 'hel' USING ANALYZER 
ngram_analyzer ORDER BY id;
+```
+
+### Adding Indexes to Existing Tables
+
+```sql
+-- Add a new index with different analyzer
+ALTER TABLE articles ADD INDEX idx_content_chinese (content)
+USING INVERTED PROPERTIES("parser" = "chinese");
+
+-- Wait for schema change to complete
+SHOW ALTER TABLE COLUMN WHERE TableName='articles';
+```
+
+### Building Indexes
+
+After adding an index, you need to build it for existing data:
+
+```sql
+-- Build specific index (non-cloud mode)
+BUILD INDEX idx_content_chinese ON articles;
+
+-- Build all indexes (cloud mode)
+BUILD INDEX ON articles;
+
+-- Check build progress
+SHOW BUILD INDEX WHERE TableName='articles';
+```
+
+### Important Notes
+
+1. **Analyzer Identity**: Two analyzers with the same tokenizer and 
token_filter configuration are considered identical. You cannot create multiple 
indexes with identical analyzer identities on the same column.
+
+2. **Index Selection Behavior**:
+   - When using `USING ANALYZER`, if the specified analyzer's index exists and 
is built, it will be used
+   - If the specified index is not built, the query falls back to non-index 
path (correct results, slower performance)
+   - Without `USING ANALYZER`, any available index may be used
+
+3. **Built-in Analyzers**: You can also use built-in analyzers directly:
+   ```sql
+   -- Using built-in analyzers
+   SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard;
+   SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none;
+   SELECT * FROM articles WHERE content MATCH '你好' USING ANALYZER chinese;
+   ```
+
+4. **Performance Considerations**:
+   - Each additional index increases storage space and write overhead
+   - Choose analyzers based on actual query patterns
+   - Consider using fewer indexes if query patterns are predictable
\ No newline at end of file
diff --git a/docs/ai/text-search/search-operators.md 
b/docs/ai/text-search/search-operators.md
index fe9ac3e5d00..238df7c484f 100644
--- a/docs/ai/text-search/search-operators.md
+++ b/docs/ai/text-search/search-operators.md
@@ -63,6 +63,45 @@ SELECT * FROM table_name WHERE content MATCH_REGEXP 
'^key_word.*';
 SELECT * FROM table_name WHERE content MATCH_PHRASE_EDGE 'search engine optim';
 ```
 
+### Specifying Analyzer with USING ANALYZER
+
+When a column has multiple inverted indexes with different analyzers, use the 
`USING ANALYZER` clause to specify which analyzer to use for the query.
+
+**Syntax:**
+```sql
+SELECT * FROM table_name WHERE column MATCH 'keywords' USING ANALYZER 
analyzer_name;
+```
+
+**Supported Operators:**
+All MATCH operators support the `USING ANALYZER` clause:
+- MATCH / MATCH_ANY
+- MATCH_ALL
+- MATCH_PHRASE
+- MATCH_PHRASE_PREFIX
+- MATCH_PHRASE_EDGE
+- MATCH_REGEXP
+
+**Examples:**
+```sql
+-- Use standard analyzer (tokenizes text into words)
+SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER 
std_analyzer;
+
+-- Use keyword analyzer (exact match, no tokenization)
+SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER 
kw_analyzer;
+
+-- Use with MATCH_PHRASE
+SELECT * FROM articles WHERE content MATCH_PHRASE 'hello world' USING ANALYZER 
std_analyzer;
+
+-- Use built-in analyzers
+SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard;
+SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none;
+```
+
+**Notes:**
+- If the specified analyzer's index is not built, the query automatically 
falls back to non-index path (correct results, slower performance)
+- If no analyzer is specified, the system uses any available index
+- Built-in analyzer names: `none` (exact match), `standard`, `chinese`
+
 ## Inverted Index Query Acceleration
 
 ### Supported Operators and Functions
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
index 4ee82a8c309..a693648cdf9 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md
@@ -288,3 +288,126 @@ select tokenize('hÉllo World', 
'"analyzer"="keyword_lowercase"');
   {"token":"hello world"}
 ]
 ```
+
+## 一列多分词索引
+
+Doris 支持在同一个列上创建多个使用不同分词器的倒排索引。这使得同一份数据可以使用不同的分词策略进行搜索，提供灵活的搜索能力。
+
+### 应用场景
+
+- **多语言支持**：在同一文本列上使用不同语言的分词器
+- **搜索精度与召回率**：使用关键词分词器进行精确匹配，使用标准分词器进行模糊搜索
+- **自动补全**：使用 edge_ngram 分词器进行前缀匹配，同时保留标准分词器用于常规搜索
+
+### 创建多个索引
+
+```sql
+-- 创建不同分词策略的分词器
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS std_analyzer
+PROPERTIES ("tokenizer" = "standard", "token_filter" = "lowercase");
+
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS kw_analyzer
+PROPERTIES ("tokenizer" = "keyword", "token_filter" = "lowercase");
+
+CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS edge_ngram_tokenizer
+PROPERTIES (
+    "type" = "edge_ngram",
+    "min_gram" = "1",
+    "max_gram" = "20",
+    "token_chars" = "letter"
+);
+
+CREATE INVERTED INDEX ANALYZER IF NOT EXISTS ngram_analyzer
+PROPERTIES ("tokenizer" = "edge_ngram_tokenizer", "token_filter" = 
"lowercase");
+
+-- 在同一列上创建多个索引
+CREATE TABLE articles (
+    id INT,
+    content TEXT,
+    -- 标准分词器用于分词搜索
+    INDEX idx_content_std (content) USING INVERTED
+        PROPERTIES("analyzer" = "std_analyzer", "support_phrase" = "true"),
+    -- 关键词分词器用于精确匹配
+    INDEX idx_content_kw (content) USING INVERTED
+        PROPERTIES("analyzer" = "kw_analyzer"),
+    -- edge n-gram 分词器用于自动补全
+    INDEX idx_content_ngram (content) USING INVERTED
+        PROPERTIES("analyzer" = "ngram_analyzer")
+) ENGINE=OLAP
+DUPLICATE KEY(id)
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES ("replication_allocation" = "tag.location.default: 1");
+```
+
+### 使用指定分词器查询
+
+使用 `USING ANALYZER` 子句指定使用哪个索引：
+
+```sql
+-- 插入测试数据
+INSERT INTO articles VALUES
+    (1, 'hello world'),
+    (2, 'hello'),
+    (3, 'world'),
+    (4, 'hello world test');
+
+-- 分词搜索：匹配包含 'hello' 词项的行
+-- 返回：1, 2, 4
+SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER 
std_analyzer ORDER BY id;
+
+-- 精确匹配：仅匹配精确的 'hello' 字符串
+-- 返回：2
+SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER kw_analyzer 
ORDER BY id;
+
+-- 使用 edge n-gram 进行前缀匹配
+-- 返回：1, 2, 4（所有以 'hel' 开头的行）
+SELECT id FROM articles WHERE content MATCH 'hel' USING ANALYZER 
ngram_analyzer ORDER BY id;
+```
+
+### 为已有表添加索引
+
+```sql
+-- 添加使用不同分词器的新索引
+ALTER TABLE articles ADD INDEX idx_content_chinese (content)
+USING INVERTED PROPERTIES("parser" = "chinese");
+
+-- 等待 schema change 完成
+SHOW ALTER TABLE COLUMN WHERE TableName='articles';
+```
+
+### 构建索引
+
+添加索引后，需要为已有数据构建索引：
+
+```sql
+-- 构建指定索引（非云模式）
+BUILD INDEX idx_content_chinese ON articles;
+
+-- 构建所有索引（云模式）
+BUILD INDEX ON articles;
+
+-- 查看构建进度
+SHOW BUILD INDEX WHERE TableName='articles';
+```
+
+### 重要说明
+
+1. **分词器身份识别**：两个具有相同 tokenizer 和 token_filter 
配置的分词器被视为相同。不能在同一列上创建具有相同分词器身份的多个索引。
+
+2. **索引选择行为**：
+   - 使用 `USING ANALYZER` 时，如果指定分词器的索引存在且已构建，则使用该索引
+   - 如果索引未构建，查询会降级到非索引路径（结果正确，但性能较慢）
+   - 未使用 `USING ANALYZER` 时，可能使用任意可用的索引
+
+3. **内置分词器**：也可以直接使用内置分词器：
+   ```sql
+   -- 使用内置分词器
+   SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard;
+   SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none;
+   SELECT * FROM articles WHERE content MATCH '你好' USING ANALYZER chinese;
+   ```
+
+4. **性能考虑**：
+   - 每增加一个索引都会增加存储空间和写入开销
+   - 根据实际查询模式选择分词器
+   - 如果查询模式可预测，考虑使用较少的索引
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md
index b730e14d7c9..84b9d314ec8 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md
@@ -63,6 +63,45 @@ SELECT * FROM table_name WHERE content MATCH_REGEXP 
'^key_word.*';
 SELECT * FROM table_name WHERE content MATCH_PHRASE_EDGE 'search engine optim';
 ```
 
+### 使用 USING ANALYZER 指定分词器
+
+当一个列上创建了多个使用不同分词器的倒排索引时，可以使用 `USING ANALYZER` 子句指定查询时使用哪个分词器。
+
+**语法：**
+```sql
+SELECT * FROM table_name WHERE column MATCH 'keywords' USING ANALYZER 
analyzer_name;
+```
+
+**支持的算子：**
+所有 MATCH 算子都支持 `USING ANALYZER` 子句：
+- MATCH / MATCH_ANY
+- MATCH_ALL
+- MATCH_PHRASE
+- MATCH_PHRASE_PREFIX
+- MATCH_PHRASE_EDGE
+- MATCH_REGEXP
+
+**示例：**
+```sql
+-- 使用标准分词器（将文本分词）
+SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER 
std_analyzer;
+
+-- 使用关键词分词器（精确匹配，不分词）
+SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER 
kw_analyzer;
+
+-- 配合 MATCH_PHRASE 使用
+SELECT * FROM articles WHERE content MATCH_PHRASE 'hello world' USING ANALYZER 
std_analyzer;
+
+-- 使用内置分词器
+SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard;
+SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none;
+```
+
+**注意事项：**
+- 如果指定分词器的索引未构建，查询会自动降级到非索引路径（结果正确，但性能较慢）
+- 如果未指定分词器，系统会使用任意可用的索引
+- 内置分词器名称：`none`（精确匹配）、`standard`（标准分词）、`chinese`（中文分词）
+
 ## 倒排索引查询加速
 
 ### 支持的运算符和函数


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) 01/01: [docs] Add multi-analyzer inverted index documentation

Reply via email to