This is an automated email from the ASF dual-hosted git repository. airborne pushed a commit to branch feature/search-lucene-mode-docs in repository https://gitbox.apache.org/repos/asf/doris-website.git
commit d1ca8dec2993a43cf84ef187e36bfcd3dd614108 Author: airborne12 <[email protected]> AuthorDate: Sun Jan 11 17:44:53 2026 +0800 docs: Add Lucene boolean mode and escape characters documentation for search function Add documentation for two new features in the SEARCH function: 1. Lucene Boolean Mode: - JSON-based options parameter (mode, minimum_should_match) - Left-to-right modifier parsing (MUST/SHOULD/MUST_NOT) - Behavior comparison table with standard mode 2. Escape Characters: - Support for escaping special characters in DSL - Backslash escapes for space, parentheses, colon, backslash Updated both English and Chinese versions of search-function.md. Related PR: apache/doris#59394 Co-Authored-By: Claude Opus 4.5 <[email protected]> --- docs/ai/text-search/search-function.md | 81 ++++++++++++++++++++++ .../current/ai/text-search/search-function.md | 81 ++++++++++++++++++++++ 2 files changed, 162 insertions(+) diff --git a/docs/ai/text-search/search-function.md b/docs/ai/text-search/search-function.md index 51fc680fcf5..468c96fd6c7 100644 --- a/docs/ai/text-search/search-function.md +++ b/docs/ai/text-search/search-function.md @@ -35,6 +35,29 @@ Usage When `default_field` is provided, Doris expands bare terms or functions to that field. For example, `SEARCH('foo bar', 'tags', 'and')` behaves like `SEARCH('tags:ALL(foo bar)')`, while `SEARCH('foo bark', 'tags')` expands to `tags:ANY(foo bark)`. Explicit boolean operators inside the DSL always take precedence over the default operator. +### Options Parameter (JSON format) + +The second parameter can also be a JSON string for advanced configuration: + +```sql +SEARCH('<search_expression>', '<options_json>') +``` + +**Supported options:** + +| Option | Type | Description | +|--------|------|-------------| +| `default_field` | string | Column name for terms without explicit field | +| `default_operator` | string | `and` or `or` for multi-term expressions | +| `mode` | string | `standard` (default) or `lucene` | +| `minimum_should_match` | integer | Minimum SHOULD clauses to match (Lucene mode only) | + +**Example:** +```sql +SELECT * FROM docs WHERE search('apple banana', + '{"default_field":"title","default_operator":"and","mode":"lucene"}'); +``` + `SEARCH()` follows SQL three-valued logic. Rows where all referenced fields are NULL evaluate to UNKNOWN (filtered out in the `WHERE` clause) unless other predicates short-circuit the expression (`TRUE OR NULL = TRUE`, `FALSE OR NULL = NULL`, `NOT NULL = NULL`), matching the behavior of dedicated text search operators. ### Current Supported Queries @@ -100,6 +123,39 @@ SELECT id, title FROM search_test_basic WHERE SEARCH('tags:ANY(python javascript) AND (category:Technology OR category:Programming)'); ``` +#### Lucene Boolean Mode + +Lucene mode mimics Elasticsearch/Lucene query_string behavior where boolean operators work as left-to-right modifiers instead of traditional boolean algebra. + +**Key differences from standard mode:** +- AND/OR/NOT are modifiers that affect adjacent terms +- Operator precedence is left-to-right +- Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) +- Pure NOT queries return empty results (need a positive clause) + +**Enable Lucene mode:** +```sql +-- Basic Lucene mode +SELECT * FROM docs WHERE search('apple AND banana', + '{"default_field":"title","mode":"lucene"}'); + +-- With minimum_should_match +SELECT * FROM docs WHERE search('apple AND banana OR cherry', + '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); +``` + +**Behavior comparison:** + +| Query | Standard Mode | Lucene Mode | +|-------|--------------|-------------| +| `a AND b` | a ∩ b | +a +b (both MUST) | +| `a OR b` | a ∪ b | a b (both SHOULD, min=1) | +| `NOT a` | ¬a | Empty (no positive clause) | +| `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | +| `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | + +**Note:** In Lucene mode, `a AND b OR c` parses left-to-right: the OR operator changes `b` from MUST to SHOULD. Use `minimum_should_match` to require SHOULD matches. + #### Phrase query - Syntax: `column:"quoted phrase"` - Semantics: matches contiguous tokens in order using the column's analyzer; quotes must wrap the entire phrase. @@ -253,6 +309,31 @@ WHERE SEARCH('properties.message:hello OR properties.category:beta') ORDER BY id; ``` +#### Escape Characters + +Use backslash (`\`) to escape special characters in DSL: + +| Escape | Description | Example | +|--------|-------------|---------| +| `\ ` | Literal space (joins terms) | `title:First\ Value` matches "First Value" | +| `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | +| `\:` | Literal colon | `title:key\:value` matches "key:value" | +| `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" | + +**Example:** +```sql +-- Search for value containing space as single term +SELECT * FROM docs WHERE search('title:First\\ Value'); + +-- Search for value with parentheses +SELECT * FROM docs WHERE search('title:hello\\(world\\)'); + +-- Search for value with colon +SELECT * FROM docs WHERE search('title:key\\:value'); +``` + +**Note:** In SQL strings, backslashes need double escaping. Use `\\` in SQL to produce a single `\` in the DSL. + ### Current Limitations - Range and list clauses (`field:[a TO b]`, `field:IN(...)`) still degrade to term lookups; rely on regular SQL predicates for numeric/date ranges or explicit `IN` filters. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-function.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-function.md index 11a45ea7256..622163252b5 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-function.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/text-search/search-function.md @@ -35,6 +35,29 @@ SEARCH('<search_expression>', '<default_field>', '<default_operator>') 提供 `default_field` 后,Doris 会把裸词项或函数自动扩展到该字段。例如 `SEARCH('foo bar', 'tags', 'and')` 等价于 `SEARCH('tags:ALL(foo bar)')`,而 `SEARCH('foo bark', 'tags')` 会展开为 `tags:ANY(foo bark)`。DSL 中显式出现的布尔操作优先级最高,会覆盖默认运算符。 +### Options 参数(JSON 格式) + +第二个参数也可以是 JSON 字符串,用于高级配置: + +```sql +SEARCH('<search_expression>', '<options_json>') +``` + +**支持的选项:** + +| 选项 | 类型 | 说明 | +|------|------|------| +| `default_field` | string | 未指定字段的词项使用的默认列名 | +| `default_operator` | string | 多词项表达式的默认运算符(`and` 或 `or`) | +| `mode` | string | `standard`(默认)或 `lucene` | +| `minimum_should_match` | integer | SHOULD 子句最小匹配数(仅 Lucene 模式) | + +**示例:** +```sql +SELECT * FROM docs WHERE search('apple banana', + '{"default_field":"title","default_operator":"and","mode":"lucene"}'); +``` + `SEARCH()` 遵循 SQL 三值逻辑。当所有参与匹配的列值均为 NULL 时结果为 UNKNOWN(在 `WHERE` 中被过滤),但若与其他子表达式组合,可按布尔短路原则返回 TRUE 或继续保留 NULL(例如 `TRUE OR NULL = TRUE`、`FALSE OR NULL = NULL`、`NOT NULL = NULL`),行为与文本检索算子保持一致。 ### 当前支持语法 @@ -100,6 +123,39 @@ SELECT id, title FROM search_test_basic WHERE SEARCH('tags:ANY(python javascript) AND (category:Technology OR category:Programming)'); ``` +#### Lucene 布尔模式 + +Lucene 模式模拟 Elasticsearch/Lucene 的 query_string 行为,布尔操作符作为左到右的修饰符工作,而非传统的布尔代数。 + +**与标准模式的主要区别:** +- AND/OR/NOT 是影响相邻词项的修饰符 +- 操作符优先级从左到右 +- 内部使用 MUST/SHOULD/MUST_NOT(类似 Lucene 的 Occur 枚举) +- 纯 NOT 查询返回空结果(需要正向子句) + +**启用 Lucene 模式:** +```sql +-- 基本 Lucene 模式 +SELECT * FROM docs WHERE search('apple AND banana', + '{"default_field":"title","mode":"lucene"}'); + +-- 使用 minimum_should_match +SELECT * FROM docs WHERE search('apple AND banana OR cherry', + '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); +``` + +**行为对比:** + +| 查询 | 标准模式 | Lucene 模式 | +|------|----------|-------------| +| `a AND b` | a ∩ b | +a +b(都是 MUST) | +| `a OR b` | a ∪ b | a b(都是 SHOULD,min=1) | +| `NOT a` | ¬a | 空结果(无正向子句) | +| `a AND NOT b` | a ∩ ¬b | +a -b(MUST a,MUST_NOT b) | +| `a AND b OR c` | (a ∩ b) ∪ c | +a b c(只有 a 是 MUST) | + +**注意:** 在 Lucene 模式中,`a AND b OR c` 从左到右解析:OR 操作符将 `b` 从 MUST 改为 SHOULD。使用 `minimum_should_match` 来要求 SHOULD 子句匹配。 + #### 词组查询 - 语法:`column:"quoted phrase"` - 语义:根据列的分析器匹配连续且有序的词项,需使用双引号包裹完整短语。 @@ -253,6 +309,31 @@ WHERE SEARCH('properties.message:hello OR properties.category:beta') ORDER BY id; ``` +#### 转义字符 + +使用反斜杠(`\`)转义 DSL 中的特殊字符: + +| 转义 | 说明 | 示例 | +|------|------|------| +| `\ ` | 字面空格(连接词项) | `title:First\ Value` 匹配 "First Value" | +| `\(` `\)` | 字面括号 | `title:hello\(world\)` 匹配 "hello(world)" | +| `\:` | 字面冒号 | `title:key\:value` 匹配 "key:value" | +| `\\` | 字面反斜杠 | `title:path\\to\\file` 匹配 "path\to\file" | + +**示例:** +```sql +-- 搜索包含空格的值作为单个词项 +SELECT * FROM docs WHERE search('title:First\\ Value'); + +-- 搜索包含括号的值 +SELECT * FROM docs WHERE search('title:hello\\(world\\)'); + +-- 搜索包含冒号的值 +SELECT * FROM docs WHERE search('title:key\\:value'); +``` + +**注意:** 在 SQL 字符串中,反斜杠需要双重转义。使用 `\\` 在 SQL 中产生 DSL 中的单个 `\`。 + ### 当前限制 - 范围与列表子句(如 `field:[a TO b]`、`field:IN(...)`)仍会降级为普通词项匹配,建议使用常规 SQL 范围/`IN` 过滤。 --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
