This is an automated email from the ASF dual-hosted git repository. airborne pushed a commit to branch feature/multi-analyzer-index-docs in repository https://gitbox.apache.org/repos/asf/doris-website.git
commit 237b4880e98eac9db277d9e62b32869132cb74bb Author: airborne12 <[email protected]> AuthorDate: Sun Jan 11 17:06:12 2026 +0800 [docs] Add multi-analyzer inverted index documentation Add documentation for the new feature that allows creating multiple inverted indexes with different analyzers on a single column. Changes: - Add USING ANALYZER syntax to search-operators.md (EN/CN) - Add "Multiple Analyzers on Single Column" section to custom-analyzer.md (EN/CN) The documentation covers: - Syntax and supported operators - Creating multiple indexes on same column - Querying with specific analyzer - Adding indexes to existing tables - Building indexes - Important notes on analyzer identity and index selection Co-Authored-By: Claude Opus 4.5 <[email protected]> --- docs/ai/text-search/custom-analyzer.md | 125 ++++++++++++++++++++- docs/ai/text-search/search-operators.md | 39 +++++++ .../current/ai/text-search/custom-analyzer.md | 123 ++++++++++++++++++++ .../current/ai/text-search/search-operators.md | 39 +++++++ 4 files changed, 325 insertions(+), 1 deletion(-) diff --git a/docs/ai/text-search/custom-analyzer.md b/docs/ai/text-search/custom-analyzer.md index f6b050de611..1c183478240 100644 --- a/docs/ai/text-search/custom-analyzer.md +++ b/docs/ai/text-search/custom-analyzer.md @@ -285,4 +285,127 @@ SELECT * FROM stars WHERE name MATCH '刘德华'; SELECT * FROM stars WHERE name MATCH 'liu'; SELECT * FROM stars WHERE name MATCH 'ldh'; SELECT * FROM stars WHERE name MATCH 'zxy'; -``` \ No newline at end of file +``` + +## Multiple Analyzers on Single Column + +Doris supports creating multiple inverted indexes with different analyzers on a single column. This enables flexible search strategies where the same data can be searched using different tokenization methods. + +### Use Cases + +- **Multi-language support**: Use different analyzers for different languages on the same text column +- **Search precision vs. recall**: Use keyword analyzer for exact match and standard analyzer for fuzzy search +- **Autocomplete**: Use edge_ngram analyzer for prefix matching while keeping standard analyzer for regular search + +### Creating Multiple Indexes + +```sql +-- Create analyzers with different tokenization strategies +CREATE INVERTED INDEX ANALYZER IF NOT EXISTS std_analyzer +PROPERTIES ("tokenizer" = "standard", "token_filter" = "lowercase"); + +CREATE INVERTED INDEX ANALYZER IF NOT EXISTS kw_analyzer +PROPERTIES ("tokenizer" = "keyword", "token_filter" = "lowercase"); + +CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS edge_ngram_tokenizer +PROPERTIES ( + "type" = "edge_ngram", + "min_gram" = "1", + "max_gram" = "20", + "token_chars" = "letter" +); + +CREATE INVERTED INDEX ANALYZER IF NOT EXISTS ngram_analyzer +PROPERTIES ("tokenizer" = "edge_ngram_tokenizer", "token_filter" = "lowercase"); + +-- Create table with multiple indexes on same column +CREATE TABLE articles ( + id INT, + content TEXT, + -- Standard analyzer for tokenized search + INDEX idx_content_std (content) USING INVERTED + PROPERTIES("analyzer" = "std_analyzer", "support_phrase" = "true"), + -- Keyword analyzer for exact match + INDEX idx_content_kw (content) USING INVERTED + PROPERTIES("analyzer" = "kw_analyzer"), + -- Edge n-gram analyzer for autocomplete + INDEX idx_content_ngram (content) USING INVERTED + PROPERTIES("analyzer" = "ngram_analyzer") +) ENGINE=OLAP +DUPLICATE KEY(id) +DISTRIBUTED BY HASH(id) BUCKETS 1 +PROPERTIES ("replication_allocation" = "tag.location.default: 1"); +``` + +### Querying with Specific Analyzer + +Use `USING ANALYZER` clause to specify which index to use: + +```sql +-- Insert test data +INSERT INTO articles VALUES + (1, 'hello world'), + (2, 'hello'), + (3, 'world'), + (4, 'hello world test'); + +-- Tokenized search: matches rows containing 'hello' token +-- Returns: 1, 2, 4 +SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER std_analyzer ORDER BY id; + +-- Exact match: only matches rows with exact 'hello' string +-- Returns: 2 +SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER kw_analyzer ORDER BY id; + +-- Prefix match with edge n-gram +-- Returns: 1, 2, 4 (all rows starting with 'hel') +SELECT id FROM articles WHERE content MATCH 'hel' USING ANALYZER ngram_analyzer ORDER BY id; +``` + +### Adding Indexes to Existing Tables + +```sql +-- Add a new index with different analyzer +ALTER TABLE articles ADD INDEX idx_content_chinese (content) +USING INVERTED PROPERTIES("parser" = "chinese"); + +-- Wait for schema change to complete +SHOW ALTER TABLE COLUMN WHERE TableName='articles'; +``` + +### Building Indexes + +After adding an index, you need to build it for existing data: + +```sql +-- Build specific index (non-cloud mode) +BUILD INDEX idx_content_chinese ON articles; + +-- Build all indexes (cloud mode) +BUILD INDEX ON articles; + +-- Check build progress +SHOW BUILD INDEX WHERE TableName='articles'; +``` + +### Important Notes + +1. **Analyzer Identity**: Two analyzers with the same tokenizer and token_filter configuration are considered identical. You cannot create multiple indexes with identical analyzer identities on the same column. + +2. **Index Selection Behavior**: + - When using `USING ANALYZER`, if the specified analyzer's index exists and is built, it will be used + - If the specified index is not built, the query falls back to non-index path (correct results, slower performance) + - Without `USING ANALYZER`, any available index may be used + +3. **Built-in Analyzers**: You can also use built-in analyzers directly: + ```sql + -- Using built-in analyzers + SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard; + SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none; + SELECT * FROM articles WHERE content MATCH '你好' USING ANALYZER chinese; + ``` + +4. **Performance Considerations**: + - Each additional index increases storage space and write overhead + - Choose analyzers based on actual query patterns + - Consider using fewer indexes if query patterns are predictable \ No newline at end of file diff --git a/docs/ai/text-search/search-operators.md b/docs/ai/text-search/search-operators.md index fe9ac3e5d00..238df7c484f 100644 --- a/docs/ai/text-search/search-operators.md +++ b/docs/ai/text-search/search-operators.md @@ -63,6 +63,45 @@ SELECT * FROM table_name WHERE content MATCH_REGEXP '^key_word.*'; SELECT * FROM table_name WHERE content MATCH_PHRASE_EDGE 'search engine optim'; ``` +### Specifying Analyzer with USING ANALYZER + +When a column has multiple inverted indexes with different analyzers, use the `USING ANALYZER` clause to specify which analyzer to use for the query. + +**Syntax:** +```sql +SELECT * FROM table_name WHERE column MATCH 'keywords' USING ANALYZER analyzer_name; +``` + +**Supported Operators:** +All MATCH operators support the `USING ANALYZER` clause: +- MATCH / MATCH_ANY +- MATCH_ALL +- MATCH_PHRASE +- MATCH_PHRASE_PREFIX +- MATCH_PHRASE_EDGE +- MATCH_REGEXP + +**Examples:** +```sql +-- Use standard analyzer (tokenizes text into words) +SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER std_analyzer; + +-- Use keyword analyzer (exact match, no tokenization) +SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER kw_analyzer; + +-- Use with MATCH_PHRASE +SELECT * FROM articles WHERE content MATCH_PHRASE 'hello world' USING ANALYZER std_analyzer; + +-- Use built-in analyzers +SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard; +SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none; +``` + +**Notes:** +- If the specified analyzer's index is not built, the query automatically falls back to non-index path (correct results, slower performance) +- If no analyzer is specified, the system uses any available index +- Built-in analyzer names: `none` (exact match), `standard`, `chinese` + ## Inverted Index Query Acceleration ### Supported Operators and Functions diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md index 4ee82a8c309..a693648cdf9 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/custom-analyzer.md @@ -288,3 +288,126 @@ select tokenize('hÉllo World', '"analyzer"="keyword_lowercase"'); {"token":"hello world"} ] ``` + +## 一列多分词索引 + +Doris 支持在同一个列上创建多个使用不同分词器的倒排索引。这使得同一份数据可以使用不同的分词策略进行搜索,提供灵活的搜索能力。 + +### 应用场景 + +- **多语言支持**:在同一文本列上使用不同语言的分词器 +- **搜索精度与召回率**:使用关键词分词器进行精确匹配,使用标准分词器进行模糊搜索 +- **自动补全**:使用 edge_ngram 分词器进行前缀匹配,同时保留标准分词器用于常规搜索 + +### 创建多个索引 + +```sql +-- 创建不同分词策略的分词器 +CREATE INVERTED INDEX ANALYZER IF NOT EXISTS std_analyzer +PROPERTIES ("tokenizer" = "standard", "token_filter" = "lowercase"); + +CREATE INVERTED INDEX ANALYZER IF NOT EXISTS kw_analyzer +PROPERTIES ("tokenizer" = "keyword", "token_filter" = "lowercase"); + +CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS edge_ngram_tokenizer +PROPERTIES ( + "type" = "edge_ngram", + "min_gram" = "1", + "max_gram" = "20", + "token_chars" = "letter" +); + +CREATE INVERTED INDEX ANALYZER IF NOT EXISTS ngram_analyzer +PROPERTIES ("tokenizer" = "edge_ngram_tokenizer", "token_filter" = "lowercase"); + +-- 在同一列上创建多个索引 +CREATE TABLE articles ( + id INT, + content TEXT, + -- 标准分词器用于分词搜索 + INDEX idx_content_std (content) USING INVERTED + PROPERTIES("analyzer" = "std_analyzer", "support_phrase" = "true"), + -- 关键词分词器用于精确匹配 + INDEX idx_content_kw (content) USING INVERTED + PROPERTIES("analyzer" = "kw_analyzer"), + -- edge n-gram 分词器用于自动补全 + INDEX idx_content_ngram (content) USING INVERTED + PROPERTIES("analyzer" = "ngram_analyzer") +) ENGINE=OLAP +DUPLICATE KEY(id) +DISTRIBUTED BY HASH(id) BUCKETS 1 +PROPERTIES ("replication_allocation" = "tag.location.default: 1"); +``` + +### 使用指定分词器查询 + +使用 `USING ANALYZER` 子句指定使用哪个索引: + +```sql +-- 插入测试数据 +INSERT INTO articles VALUES + (1, 'hello world'), + (2, 'hello'), + (3, 'world'), + (4, 'hello world test'); + +-- 分词搜索:匹配包含 'hello' 词项的行 +-- 返回:1, 2, 4 +SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER std_analyzer ORDER BY id; + +-- 精确匹配:仅匹配精确的 'hello' 字符串 +-- 返回:2 +SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER kw_analyzer ORDER BY id; + +-- 使用 edge n-gram 进行前缀匹配 +-- 返回:1, 2, 4(所有以 'hel' 开头的行) +SELECT id FROM articles WHERE content MATCH 'hel' USING ANALYZER ngram_analyzer ORDER BY id; +``` + +### 为已有表添加索引 + +```sql +-- 添加使用不同分词器的新索引 +ALTER TABLE articles ADD INDEX idx_content_chinese (content) +USING INVERTED PROPERTIES("parser" = "chinese"); + +-- 等待 schema change 完成 +SHOW ALTER TABLE COLUMN WHERE TableName='articles'; +``` + +### 构建索引 + +添加索引后,需要为已有数据构建索引: + +```sql +-- 构建指定索引(非云模式) +BUILD INDEX idx_content_chinese ON articles; + +-- 构建所有索引(云模式) +BUILD INDEX ON articles; + +-- 查看构建进度 +SHOW BUILD INDEX WHERE TableName='articles'; +``` + +### 重要说明 + +1. **分词器身份识别**:两个具有相同 tokenizer 和 token_filter 配置的分词器被视为相同。不能在同一列上创建具有相同分词器身份的多个索引。 + +2. **索引选择行为**: + - 使用 `USING ANALYZER` 时,如果指定分词器的索引存在且已构建,则使用该索引 + - 如果索引未构建,查询会降级到非索引路径(结果正确,但性能较慢) + - 未使用 `USING ANALYZER` 时,可能使用任意可用的索引 + +3. **内置分词器**:也可以直接使用内置分词器: + ```sql + -- 使用内置分词器 + SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard; + SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none; + SELECT * FROM articles WHERE content MATCH '你好' USING ANALYZER chinese; + ``` + +4. **性能考虑**: + - 每增加一个索引都会增加存储空间和写入开销 + - 根据实际查询模式选择分词器 + - 如果查询模式可预测,考虑使用较少的索引 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md index b730e14d7c9..84b9d314ec8 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-operators.md @@ -63,6 +63,45 @@ SELECT * FROM table_name WHERE content MATCH_REGEXP '^key_word.*'; SELECT * FROM table_name WHERE content MATCH_PHRASE_EDGE 'search engine optim'; ``` +### 使用 USING ANALYZER 指定分词器 + +当一个列上创建了多个使用不同分词器的倒排索引时,可以使用 `USING ANALYZER` 子句指定查询时使用哪个分词器。 + +**语法:** +```sql +SELECT * FROM table_name WHERE column MATCH 'keywords' USING ANALYZER analyzer_name; +``` + +**支持的算子:** +所有 MATCH 算子都支持 `USING ANALYZER` 子句: +- MATCH / MATCH_ANY +- MATCH_ALL +- MATCH_PHRASE +- MATCH_PHRASE_PREFIX +- MATCH_PHRASE_EDGE +- MATCH_REGEXP + +**示例:** +```sql +-- 使用标准分词器(将文本分词) +SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER std_analyzer; + +-- 使用关键词分词器(精确匹配,不分词) +SELECT * FROM articles WHERE content MATCH 'hello world' USING ANALYZER kw_analyzer; + +-- 配合 MATCH_PHRASE 使用 +SELECT * FROM articles WHERE content MATCH_PHRASE 'hello world' USING ANALYZER std_analyzer; + +-- 使用内置分词器 +SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard; +SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none; +``` + +**注意事项:** +- 如果指定分词器的索引未构建,查询会自动降级到非索引路径(结果正确,但性能较慢) +- 如果未指定分词器,系统会使用任意可用的索引 +- 内置分词器名称:`none`(精确匹配)、`standard`(标准分词)、`chinese`(中文分词) + ## 倒排索引查询加速 ### 支持的运算符和函数 --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
