This is an automated email from the ASF dual-hosted git repository.
dataroaring pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new f5f8f0e6ef2 docs(table-design): clarity pass on Index Overview
scenario prose (#3942)
f5f8f0e6ef2 is described below
commit f5f8f0e6ef266588a471806d0c674ade3d6dd779
Author: Yongqiang YANG <[email protected]>
AuthorDate: Thu Jun 18 05:36:33 2026 -0700
docs(table-design): clarity pass on Index Overview scenario prose (#3942)
## What
A focused readability pass on the index mechanism descriptions in
**Index Overview > "How Each Index Works"** (Scenarios 1-2): the prefix
/ inverted / ZoneMap / BloomFilter / NGram explanations.
- Split long run-on sentences into short declarative ones.
- Active voice with Doris as the subject (e.g. "Doris builds an inverted
table…" instead of passive constructions).
- Remove the marketing word "more powerful" ("replaced by the more
powerful inverted index" → "replaced by the inverted index").
No technical meaning changed. Reference tables (comparison, operators)
and the capability lists in Scenarios 3-4 are left as-is, since terse
reference content is appropriate there.
## Scope
- Touches only the scenario prose. It does **not** change the "Start
Here" section, which is in PR #3941; the two PRs edit different regions
of the same file.
- EN + 中文; applies to both `docs/` and `versioned_docs/version-4.x`.
## Why
This content reads as dense, translated-from-source prose (60+ word
single sentences, marketing adjective). The rest of the page was made
task-first in #3941; this brings the explanation prose to the same
clarity bar.
---
docs/table-design/index/index-overview.md | 12 ++++++------
.../current/table-design/index/index-overview.md | 12 ++++++------
.../version-4.x/table-design/index/index-overview.md | 12 ++++++------
.../version-4.x/table-design/index/index-overview.md | 12 ++++++------
4 files changed, 24 insertions(+), 24 deletions(-)
diff --git a/docs/table-design/index/index-overview.md
b/docs/table-design/index/index-overview.md
index 0a175abd0a2..205889dc1f2 100644
--- a/docs/table-design/index/index-overview.md
+++ b/docs/table-design/index/index-overview.md
@@ -42,11 +42,11 @@ Apache Doris provides four categories of indexes for
different query scenarios:
Apache Doris provides two point-query indexes:
-- **[Prefix index](./prefix-index.md)**: Apache Doris stores data in order
according to the sort key and creates a sparse prefix index every 1024 rows.
The Key in the index is the value of the sort columns of the first row in the
current 1024-row group. When a query involves the sorted columns, the system
finds the first row of the relevant 1024-row group and starts scanning from
there.
-- **[Inverted index](./inverted-index/overview.md)**: For columns with an
inverted index, Apache Doris builds an inverted table that maps each value to
the corresponding set of row numbers. For equality queries, it first looks up
the row number set from the inverted table and then directly reads the data of
those rows, avoiding a row-by-row scan and reducing I/O to accelerate the
query. Inverted indexes can also accelerate range filtering and text keyword
matching. The algorithms are mor [...]
+- **[Prefix index](./prefix-index.md)**: Doris stores data sorted by the sort
key and builds a sparse prefix index every 1024 rows. Each index entry holds
the sort-column values of the first row in its group. When a query filters on
the sort columns, Doris jumps to the right group and scans from there.
+- **[Inverted index](./inverted-index/overview.md)**: Doris builds an inverted
table that maps each value to the rows that contain it. For an equality query,
it looks up the matching rows and reads only those, which avoids a full scan
and cuts I/O. Inverted indexes also speed up range filters and text keyword
search; the algorithm is more involved, but the idea is the same.
:::note
-The previous BITMAP index has been replaced by the more powerful inverted
index.
+The BITMAP index has been replaced by the inverted index.
:::
### Scenario 2: Many Rows Match the Condition (Skip Indexes)
@@ -59,9 +59,9 @@ The previous BITMAP index has been replaced by the more
powerful inverted index.
Apache Doris provides three skip indexes:
-- **ZoneMap index**: Automatically maintains statistics for each column,
recording the maximum value, minimum value, and whether NULL exists for each
data file (Segment) and data block (Page). For equality queries, range queries,
and IS NULL, the maximum value, minimum value, and the presence of NULL can be
used to determine whether a data file or data block may contain rows that
satisfy the condition. If not, the corresponding file or block is skipped,
reducing I/O and accelerating the query.
-- **[BloomFilter index](./bloomfilter.md)**: Stores the possible values of the
indexed column in a BloomFilter data structure. A BloomFilter can quickly
determine whether a value exists, with very low storage overhead. For equality
queries, if the value is determined to be absent from the BloomFilter, the
corresponding data file or data block can be skipped, reducing I/O and
accelerating the query.
-- **[NGram BloomFilter index](./ngram-bloomfilter-index.md)**: Used to
accelerate text LIKE queries. The basic principle is similar to the BloomFilter
index. The difference is that what is stored in the BloomFilter is not the
original text value but each token produced by NGram tokenization of the text.
For LIKE queries, the LIKE pattern is also tokenized with NGram, and each token
is checked against the BloomFilter. If any token is absent, the corresponding
data file or data block does [...]
+- **ZoneMap index**: Doris automatically keeps per-column statistics (min,
max, and whether NULLs exist) for each data file (Segment) and data block
(Page). For equality, range, and IS NULL filters, it uses these stats to decide
whether a file or block can contain matching rows. If it can't, Doris skips
that file or block and cuts I/O.
+- **[BloomFilter index](./bloomfilter.md)**: Doris stores the column's values
in a BloomFilter, a structure that tells you whether a value is present with
very little storage. For an equality query, if the value isn't in the filter,
Doris skips that file or block and cuts I/O.
+- **[NGram BloomFilter index](./ngram-bloomfilter-index.md)**: Speeds up text
`LIKE` queries. It works like the BloomFilter index, but stores NGram tokens of
the text instead of whole values. For a `LIKE` query, Doris tokenizes the
pattern the same way and checks each token against the filter. If any token is
missing, the file or block can't match, so Doris skips it.
### Scenario 3: Full-Text Search on Text (Inverted Index)
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/index-overview.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/index-overview.md
index b5297253f4b..5ee231e1b42 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/index-overview.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/table-design/index/index-overview.md
@@ -42,11 +42,11 @@ Apache Doris 针对不同查询场景提供四类索引:**点查索引**、**
Apache Doris 提供两种点查索引:
-- **[前缀索引](./prefix-index.md)**:Apache Doris 按照排序键以有序方式存储数据,并每隔 1024
行创建一个稀疏前缀索引。索引中的 Key 是当前 1024 行中第一行的排序列值。当查询涉及已排序列时,系统会找到相关的 1024
行组的第一行并从那里开始扫描。
--
**[倒排索引](./inverted-index/overview.md)**:对建立倒排索引的列,构建每个值到对应行号集合的倒排表。等值查询时,先从倒排表中查到行号集合,然后直接读取对应行的数据,避免逐行扫描,从而减少
I/O 加速查询。倒排索引还能加速范围过滤、文本关键词匹配,算法更加复杂但基本原理类似。
+- **[前缀索引](./prefix-index.md)**:Doris 按排序键有序存储数据,并每隔 1024
行创建一个稀疏前缀索引。每个索引项记录所在组第一行的排序列值。当查询按排序列过滤时,Doris 定位到对应的行组并从那里开始扫描。
+- **[倒排索引](./inverted-index/overview.md)**:Doris
构建一张倒排表,把每个值映射到包含它的行。等值查询时,先查到匹配的行,再只读取这些行,避免全表扫描并减少
I/O。倒排索引也能加速范围过滤和文本关键词检索;算法更复杂,但思路相同。
:::note
-之前的 BITMAP 索引已被功能更强的倒排索引取代。
+BITMAP 索引已被倒排索引取代。
:::
### 场景二:满足条件的行较多(跳数索引)
@@ -59,9 +59,9 @@ Apache Doris 提供两种点查索引:
Apache Doris 提供三种跳数索引:
-- **ZoneMap 索引**:自动维护每一列的统计信息,为每一个数据文件(Segment)和数据块(Page)记录最大值、最小值、是否有
NULL。对于等值查询、范围查询、IS NULL,可以通过最大值、最小值、是否有 NULL
判断数据文件和数据块是否可能包含满足条件的数据,如果不包含则跳过对应文件或数据块,减少 I/O 加速查询。
-- **[BloomFilter 索引](./bloomfilter.md)**:将索引列的可能取值存入 BloomFilter
数据结构。BloomFilter 可以快速判断一个值是否存在,且存储空间占用很低。对于等值查询,如果判断该值不在 BloomFilter
中,就可以跳过对应的数据文件或数据块,减少 I/O 加速查询。
-- **[NGram BloomFilter 索引](./ngram-bloomfilter-index.md)**:用于加速文本 LIKE
查询。基本原理与 BloomFilter 索引类似,区别在于存入 BloomFilter 的不是原始文本值,而是对文本进行 NGram 分词后的每个词。对于
LIKE 查询,将 LIKE 的 pattern 也进行 NGram 分词,判断每个词是否在 BloomFilter
中,如果某个词不在则对应的数据文件或数据块就不满足 LIKE 条件,可以跳过这部分数据,减少 I/O 加速查询。
+- **ZoneMap 索引**:Doris 自动为每一列维护统计信息(最小值、最大值、是否有
NULL),按数据文件(Segment)和数据块(Page)记录。对于等值、范围、IS NULL 过滤,Doris
用这些统计判断某个文件或数据块是否可能包含匹配行。若不可能,则跳过该文件或数据块,减少 I/O。
+- **[BloomFilter 索引](./bloomfilter.md)**:Doris 将列的取值存入
BloomFilter,这种结构能以很小的存储判断某个值是否存在。等值查询时,若该值不在 BloomFilter 中,Doris 跳过对应文件或数据块,减少
I/O。
+- **[NGram BloomFilter 索引](./ngram-bloomfilter-index.md)**:加速文本 `LIKE` 查询。原理与
BloomFilter 索引类似,但存入的是文本的 NGram 分词,而非完整取值。`LIKE` 查询时,Doris 对 pattern
做同样的分词,逐个检查 token 是否在过滤器中。只要有一个 token 缺失,该文件或数据块就无法匹配,Doris 直接跳过。
### 场景三:文本全文检索(倒排索引)
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/index/index-overview.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/index/index-overview.md
index b5297253f4b..5ee231e1b42 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/index/index-overview.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/table-design/index/index-overview.md
@@ -42,11 +42,11 @@ Apache Doris 针对不同查询场景提供四类索引:**点查索引**、**
Apache Doris 提供两种点查索引:
-- **[前缀索引](./prefix-index.md)**:Apache Doris 按照排序键以有序方式存储数据,并每隔 1024
行创建一个稀疏前缀索引。索引中的 Key 是当前 1024 行中第一行的排序列值。当查询涉及已排序列时,系统会找到相关的 1024
行组的第一行并从那里开始扫描。
--
**[倒排索引](./inverted-index/overview.md)**:对建立倒排索引的列,构建每个值到对应行号集合的倒排表。等值查询时,先从倒排表中查到行号集合,然后直接读取对应行的数据,避免逐行扫描,从而减少
I/O 加速查询。倒排索引还能加速范围过滤、文本关键词匹配,算法更加复杂但基本原理类似。
+- **[前缀索引](./prefix-index.md)**:Doris 按排序键有序存储数据,并每隔 1024
行创建一个稀疏前缀索引。每个索引项记录所在组第一行的排序列值。当查询按排序列过滤时,Doris 定位到对应的行组并从那里开始扫描。
+- **[倒排索引](./inverted-index/overview.md)**:Doris
构建一张倒排表,把每个值映射到包含它的行。等值查询时,先查到匹配的行,再只读取这些行,避免全表扫描并减少
I/O。倒排索引也能加速范围过滤和文本关键词检索;算法更复杂,但思路相同。
:::note
-之前的 BITMAP 索引已被功能更强的倒排索引取代。
+BITMAP 索引已被倒排索引取代。
:::
### 场景二:满足条件的行较多(跳数索引)
@@ -59,9 +59,9 @@ Apache Doris 提供两种点查索引:
Apache Doris 提供三种跳数索引:
-- **ZoneMap 索引**:自动维护每一列的统计信息,为每一个数据文件(Segment)和数据块(Page)记录最大值、最小值、是否有
NULL。对于等值查询、范围查询、IS NULL,可以通过最大值、最小值、是否有 NULL
判断数据文件和数据块是否可能包含满足条件的数据,如果不包含则跳过对应文件或数据块,减少 I/O 加速查询。
-- **[BloomFilter 索引](./bloomfilter.md)**:将索引列的可能取值存入 BloomFilter
数据结构。BloomFilter 可以快速判断一个值是否存在,且存储空间占用很低。对于等值查询,如果判断该值不在 BloomFilter
中,就可以跳过对应的数据文件或数据块,减少 I/O 加速查询。
-- **[NGram BloomFilter 索引](./ngram-bloomfilter-index.md)**:用于加速文本 LIKE
查询。基本原理与 BloomFilter 索引类似,区别在于存入 BloomFilter 的不是原始文本值,而是对文本进行 NGram 分词后的每个词。对于
LIKE 查询,将 LIKE 的 pattern 也进行 NGram 分词,判断每个词是否在 BloomFilter
中,如果某个词不在则对应的数据文件或数据块就不满足 LIKE 条件,可以跳过这部分数据,减少 I/O 加速查询。
+- **ZoneMap 索引**:Doris 自动为每一列维护统计信息(最小值、最大值、是否有
NULL),按数据文件(Segment)和数据块(Page)记录。对于等值、范围、IS NULL 过滤,Doris
用这些统计判断某个文件或数据块是否可能包含匹配行。若不可能,则跳过该文件或数据块,减少 I/O。
+- **[BloomFilter 索引](./bloomfilter.md)**:Doris 将列的取值存入
BloomFilter,这种结构能以很小的存储判断某个值是否存在。等值查询时,若该值不在 BloomFilter 中,Doris 跳过对应文件或数据块,减少
I/O。
+- **[NGram BloomFilter 索引](./ngram-bloomfilter-index.md)**:加速文本 `LIKE` 查询。原理与
BloomFilter 索引类似,但存入的是文本的 NGram 分词,而非完整取值。`LIKE` 查询时,Doris 对 pattern
做同样的分词,逐个检查 token 是否在过滤器中。只要有一个 token 缺失,该文件或数据块就无法匹配,Doris 直接跳过。
### 场景三:文本全文检索(倒排索引)
diff --git a/versioned_docs/version-4.x/table-design/index/index-overview.md
b/versioned_docs/version-4.x/table-design/index/index-overview.md
index 0a175abd0a2..205889dc1f2 100644
--- a/versioned_docs/version-4.x/table-design/index/index-overview.md
+++ b/versioned_docs/version-4.x/table-design/index/index-overview.md
@@ -42,11 +42,11 @@ Apache Doris provides four categories of indexes for
different query scenarios:
Apache Doris provides two point-query indexes:
-- **[Prefix index](./prefix-index.md)**: Apache Doris stores data in order
according to the sort key and creates a sparse prefix index every 1024 rows.
The Key in the index is the value of the sort columns of the first row in the
current 1024-row group. When a query involves the sorted columns, the system
finds the first row of the relevant 1024-row group and starts scanning from
there.
-- **[Inverted index](./inverted-index/overview.md)**: For columns with an
inverted index, Apache Doris builds an inverted table that maps each value to
the corresponding set of row numbers. For equality queries, it first looks up
the row number set from the inverted table and then directly reads the data of
those rows, avoiding a row-by-row scan and reducing I/O to accelerate the
query. Inverted indexes can also accelerate range filtering and text keyword
matching. The algorithms are mor [...]
+- **[Prefix index](./prefix-index.md)**: Doris stores data sorted by the sort
key and builds a sparse prefix index every 1024 rows. Each index entry holds
the sort-column values of the first row in its group. When a query filters on
the sort columns, Doris jumps to the right group and scans from there.
+- **[Inverted index](./inverted-index/overview.md)**: Doris builds an inverted
table that maps each value to the rows that contain it. For an equality query,
it looks up the matching rows and reads only those, which avoids a full scan
and cuts I/O. Inverted indexes also speed up range filters and text keyword
search; the algorithm is more involved, but the idea is the same.
:::note
-The previous BITMAP index has been replaced by the more powerful inverted
index.
+The BITMAP index has been replaced by the inverted index.
:::
### Scenario 2: Many Rows Match the Condition (Skip Indexes)
@@ -59,9 +59,9 @@ The previous BITMAP index has been replaced by the more
powerful inverted index.
Apache Doris provides three skip indexes:
-- **ZoneMap index**: Automatically maintains statistics for each column,
recording the maximum value, minimum value, and whether NULL exists for each
data file (Segment) and data block (Page). For equality queries, range queries,
and IS NULL, the maximum value, minimum value, and the presence of NULL can be
used to determine whether a data file or data block may contain rows that
satisfy the condition. If not, the corresponding file or block is skipped,
reducing I/O and accelerating the query.
-- **[BloomFilter index](./bloomfilter.md)**: Stores the possible values of the
indexed column in a BloomFilter data structure. A BloomFilter can quickly
determine whether a value exists, with very low storage overhead. For equality
queries, if the value is determined to be absent from the BloomFilter, the
corresponding data file or data block can be skipped, reducing I/O and
accelerating the query.
-- **[NGram BloomFilter index](./ngram-bloomfilter-index.md)**: Used to
accelerate text LIKE queries. The basic principle is similar to the BloomFilter
index. The difference is that what is stored in the BloomFilter is not the
original text value but each token produced by NGram tokenization of the text.
For LIKE queries, the LIKE pattern is also tokenized with NGram, and each token
is checked against the BloomFilter. If any token is absent, the corresponding
data file or data block does [...]
+- **ZoneMap index**: Doris automatically keeps per-column statistics (min,
max, and whether NULLs exist) for each data file (Segment) and data block
(Page). For equality, range, and IS NULL filters, it uses these stats to decide
whether a file or block can contain matching rows. If it can't, Doris skips
that file or block and cuts I/O.
+- **[BloomFilter index](./bloomfilter.md)**: Doris stores the column's values
in a BloomFilter, a structure that tells you whether a value is present with
very little storage. For an equality query, if the value isn't in the filter,
Doris skips that file or block and cuts I/O.
+- **[NGram BloomFilter index](./ngram-bloomfilter-index.md)**: Speeds up text
`LIKE` queries. It works like the BloomFilter index, but stores NGram tokens of
the text instead of whole values. For a `LIKE` query, Doris tokenizes the
pattern the same way and checks each token against the filter. If any token is
missing, the file or block can't match, so Doris skips it.
### Scenario 3: Full-Text Search on Text (Inverted Index)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]